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We  hope  that  this  Workshop  will  be  the  first  of  a  series  of  many  which  will  serve  to  promote 
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PLENARY  &  SPECIAL  TALKS 


Information  Theory  and  Statistics 

Tom  Cover^ 

DepEirtments  of  Electrical  Engineering  and  Statistics,  Stanford  University,  Durand  121,  Stanford,  CA  94305-4055,  USA, 

email:  cover@isl.stanford.edu 


Abstract  —  The  main  theorems  in  information  the¬ 
ory  and  statistics  are  put  in  context,  the  differences 
are  discussed,  and  some  of  the  open  research  problems 
are  mentioned. 

I.  Introduction 

Probability  theory  has  produced  a  number  of  strong  general 
statements  —  truths  about  stochastics  processes  that  give  ran¬ 
dom  processes  a  deterministic  flavor.  These  successes  include 
the  strong  law  of  large  numbers,  the  central  limit  theorem,  the 
law  of  the  iterated  logarithm,  the  ergodic  theorem,  and  limit 
theorems  for  Markov  processes. 

Information  theory,  on  the  other  hand,  has  been  primarily 
motivated  by  an  attempt  to  optimize  certain  processes,  for  ex¬ 
ample,  to  minimize  the  description  length  of  random  processes 
or  to  maximize  the  number  of  distinguishable  signals  in  the 
presence  of  noise.  This  different  orientation  —  optimization 
—  has  led  to  a  number  of  additional  insights  which  contribute 
to  the  body  of  knowledge  in  probability  theory.  For  exam¬ 
ple,  the  central  limit  theorem  can  be  proved  by  way  of  the 
entropy  power  inequality,  yielding  a  monotonic  convergence 
to  the  limit.  And  the  law  of  large  numbers  has  a  counterpart 
in  the  asymptotic  equip£U'tition  property,  which  says  that  all 
ergodic  stochastic  processes  can  be  considered  as  a  uniform 
distribution  over  a  small  set  of  typical  sequences  character¬ 
ized  by  the  entropy  rate. 

11.  Specifics 

We  will  demonstrate  some  of  the  points  of  intersection  of 
information  theory  and  statistics,  and  mention  some  problems 
in  physics  and  computer  science  that  require  a  rigorous  prob¬ 
abilistic  treatment. 

The  discussion  will  include  the  following: 

1.  Chernoff  information,  error  exponents,  large  deviation 
theory. 

2.  The  geometry  of  information. 

3.  Structinre  of  ergodic  processes,  the  AEP  and  the  Slepian 
Wolf  theorem. 

4.  The  common  basis  for  the  Cramer-Rao,  entropy  power, 
Brunn-Minkowski,  and  Heisenberg  uncertainty  inequal¬ 
ities.  (See  Dembo.) 

5.  Entropy  rate  (compressibility  limits),  channel  capacity 
(distinguishability  limits).  The  duality  of  the  two. 

6.  The  central  limit  theorem  and  the  entropy  power  in¬ 
equality.  (See  Barron.) 

7.  Information  loss  and  the  second  law.  The  argument 
that  entropy  will  be  lost  when  mass  is  thrown  into  a 
black  hole,  together  with  the  even  stronger  belief  that 
entropy  increases  (the  second  law  of  thermodynamics), 
led  Beckenstein  and  Hawking  to  argue  that  the  mass  of 

^This  work  was  partially  supported  by  NSF  Grant  NCR-9205663 
and  JSEP  Contract  DAAH04-94-G-0058. 


the  black  hole  (which  increases  when  matter  is  thrown 
into  it)  is  proportional  to  its  entropy  (the  logarithm  of 
the  number  of  ways  in  which  it  could  be  made)  thus 
preserving  the  second  law. 

8.  Entropy  increase.  The  H  theorem  in  statistical  mechan¬ 
ics  shows  that  entropy  increases  with  time.  But  the  laws 
of  physics  are  time  reversible.  What  is  going  on? 

9.  Investment  processes.  Duality  with  data  compression. 

III.  Remarks 

Certain  theorems  from  information  theory  like  the 
asymptotic  equipartition  property  (the  Shannon-MacMillan- 
Breiman  theorem)  may  deserve  to  be  considered  part  of  the 
hard  core  of  probability  theory.  Yet  other  results  in  infor¬ 
mation  theory  like  the  entropy  power  inequality  turn  out  to 
play  an  important  role  in  interpreting  the  central  limit  the¬ 
orem.  And  finally,  some  of  the  tools  in  information  theory 
inay  have  importeint  roles  to  play  in  physics,  just  as  ergodic 
theory,  developed  in  the  1930s,  resolved  some  of  the  problems 
in  statistical  mechanics. 
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Abstract  —  Lossy  compression  and  classification  al¬ 
gorithms  both  attempt  to  reduce  a  large  collection  of 
possible  observations  into  a  few  representative  cate¬ 
gories  so  as  to  preserve  essential  information.  In  this 
talk  a  recently  developed  framework  for  combining 
classification  and  compression  into  one  or  two  quan¬ 
tizers  is  described  along  with  some  examples  and  re¬ 
lated  to  other  quantizer-based  classification  schemes. 

The  object  of  classification  is  to  map  an  observed  vector 
into  one  of  a  finite  collection  of  indices  representing  a  class 
or  type  of  data.  For  example,  one  might  view  a  block  of  pix¬ 
els  in  a  digital  mammogram  and  wish  to  classify  the  block 
as  containing  a  microcalcification  or  not.  The  quality  of  of 
the  classifier  is  typically  measured  by  its  average  Bayes  risk. 
Another  important  attribute  of  the  classifier  is  its  complexity, 
how  hard  it  is  to  convert  the  observed  vector  into  the  final 
decision.  The  object  of  vector  quantization  is  to  map  an  ob¬ 
served  vector  into  one  of  a  finite  number  of  representatives  or 
templates.  Here  quality  is  typically  measured  by  an  average 
distortion  such  as  squared  error.  Instead  of  complexity,  the 
second  parameter  of  primary  importance  is  typically  bit  rate, 
measured  either  by  the  log  of  the  number  of  templates  or  by 
the  entropy  of  the  quantizer  output. 

Classification  and  compression  can  both  be  viewed  as  a 
quantization  operation,  mapping  a  possibly  continuous  space 
into  a  finite  one.  The  measures  of  quality  differ,  but  both 
Bayes  risk  and  squared  error  can  be  viewed  as  a  form  of  dis¬ 
tortion  to  be  minimized  subject  to  constraints  on  complexity 
or  bit  rate.  Futhermore,  bit  rate  is  relevant  to  classification 
if  continuous  data  is  quantized  to  prior  to  digital  classifica¬ 
tion,  and  complexity  is  important  for  compression  to  ensure 
efficient  software  or  hardware  implementation. 

Many  quantizer-based  classifiers  have  been  proposed  in  the 
literature,  including  classical  nearest  neighbor  and  clustered 
variations  [1,  2,  3].  Perhaps  the  most  famous  quantization 
approach  to  classification  is  Kohonen’s  “learning  vector  quan¬ 
tizer”  (LVQ)  [4].  While  codebook  design  differs,  all  use  a 
Euclidean  nearest  neighbor  encoder. 

An  alternative  approach  is  to  incorporate  explicitly  a  Bayes 
risk  term  into  the  average  distortion  minimized  by  a  quantizer 
using  a  Lagrange  multiplier  and  thereby  include  both  a  dis¬ 
tortion  term  reflecting  the  general  quality  of  the  reproduction 
(such  as  signal-to-noise  ratio)  and  one  reflecting  the  intended 
application  (such  as  Bayes  risk  or  classification  error).  By 
weighting  these  two  components  one  can,  in  effect,  optimize 
for  general  appearance  and  specific  task  [5,  6,  7,  8,  9,  10]. 

Let  9  be  a  /fc-dimensional  vector  quantizer  with  codebook 
C,  partition  V,  encoder  a,  and  decoder  /3.  Let  <5  be  a  classi¬ 
fier  assigning  a  class  label  S{i)  €  Ti  to  each  possible  encoder 
output  i  =  1,...,A^,  producing  an  overall  classification  rule 
of  7(a:)  =  5(a(a:)).  Let  d  denote  a  distortion  measure  such  as 
squcired  error. 

^Portions  of  this  work  was  supported  in  part  by  the  National  Sci¬ 
ence  Foundation  under  grants  MIP-9311190,  and  DMS-9101548  and 
by  the  National  Institutes  of  Health  under  grant  1R01-CA55325. 


The  compression  performance  measured  by  mean  squared 
error  is 

N 

D{a,l3)  =  ^E;[d(X,/?(t))|a(A)  =  i]Pr(Q(X)  =  i).  (1) 

:  =  1 

The  classification  performance  measured  by  Bayes  risk  is 

M  M 

B{a,5)  =  EE  Cik  Pr(5(Q(X))  =  fc  and  y  =  j)  (2) 

*=1  3  =  1 

In  order  to  simultaneously  consider  the  compression  and  clas¬ 
sification  abilities  of  the  encoder,  we  use  a  Lagrangian  mod¬ 
ified  distortion  expression  which  includes  both  ordinary  dis¬ 
tortion  and  classification  error: 

Jy{a,l},5)  =  D{a,p)  +  \B{a,5).  (3) 

This  formulation  leads  to  necessary  conditions  for  an  opti¬ 
mal  code  and  a  generalized  Lloyd  iterative  code  design  algo¬ 
rithm,  which  are  surveyed  with  examples  in  this  talk. 
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Abstract  —  The  related  problems  of  (finite-length) 
robust  prediction  and  maximizing  spectral  entropy 
over  a  simplex  of  covariance  matrices  are  considered. 
General  properties  of  iterative  solutions  of  these  prob¬ 
lems  are  developed,  and  monotone  convergence  proofs 
are  presented  for  two  algorithms  that  provide  such 
solutions.  The  analogous  problems  for  simplexes  of 
spectral  densities  are  also  considered. 

I.  Summary 

The  problem  of  designing  an  optimum  predictor  for  an  ob¬ 
served  time  series  is,  of  coinse,  a  fundamental  one  that  arises 
in  innumerable  applications.  One  reason  for  the  central  role 
of  this  problem  is  that  the  design  of  such  a  predictor  is  tan- 
tamoimt  to  the  selection  of  a  stochastic  realization  model  of 
the  time  series  [6] ,  which  can  in  turn  be  used  in  tasks  such  as 
control,  data  compression,  and  so  forth. 

The  classical  Levinson  problem  -  that  is  the  finite-length 
linear  prediction  of  a  covariance-stationary  time  series  -  is 
a  central  problem  within  this  general  area.  The  maximum- 
entropy  spectrum  fitting  problem  [8]  is  the  counterpart  of 
the  Levinson  problem  in  the  context  of  stochastic  real¬ 
ization.  Both  the  Levinson  problem  and  the  maximum- 
entropy  spectrum  fitting  problem  Involve  the  computation  of 
a  model/predictor  for  a  time  series  from  knowledge  of  the  co- 
variance  structure  of  the  series  up  to  some  finite  lag,  say  p. 

When  this  covariance  structure  is  not  known  exactly,  but 
rather  is  known  only  to  lie  in  an  uncertainty  class  of  covari¬ 
ances,  then  the  classical  Levinson  formulation  for  predictor 
design  can  be  replaced  by  a  minimax  robustness  formulation, 
as  developed  in  several  works  (see  [9]  for  a  review).  In  this 
minimax  formulation,  the  robust  predictor  is  the  optimum 
predictor  designed  for  a  least-favorable  covariance  structure, 
chosen  to  maximize  the  spectral  entropy  in  the  time  series.  In 
the  context  of  model  determination,  the  counterpart  to  robust 
prediction  is  approximate  stochastic  realization  [5]. 

In  this  talk,  we  consider  the  minimax  robust  prediction 
problem  for  the  situation  in  which  the  imcertainty  class  of  co- 
variances  is  a  finite-dimensional  simplex  of  covariance  matri¬ 
ces.  As  we  shall  note,  this  formulation  arises  naturally  from 
the  consideration  of  confidence  intervals  for  covariance  esti¬ 
mates.  Moreover,  solutions  for  such  uncertainty  classes  can 
be  used  as  intermediate  iterations  for  other  convex  uncertainty 
classes,  as  will  be  discussed  in  the  paper  (see  also  [3]). 

This  talk  is  organized  as  follows  (details  of  this  work  can 
be  found  in  [10]).  First,  we  review  briefly  the  problems 
of  finite-length  minimax  robust  prediction  and  maximum- 
entropy  spectrum  fitting,  and  in  particular  we  note  that  both 
problems  have  essentially  the  same  solution.  We  also  pro¬ 
vide  necessary  and  sufficient  conditions  for  solutions  to  these 
problems  over  general  uncertainty  classes  and  over  simplexes. 
Next,  general  properties  of  iterative  solutions  to  these  prob¬ 
lems  are  presented.  In  particular,  the  convergence  of  a  series 

^This  work  was  supported  by  a  UK  SERC  Senior  Visiting  Fel¬ 
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of  entropies  to  the  maximum  entropy  is  shown  to  be  equivalent 
to  convergence  of  the  corresponding  covariances  and  predic¬ 
tors.  Two  iterative  algorithms  for  maximizing  entropy  over 
a  simplex  are  then  developed,  together  with  proofs  of  their 
monotone  convergence.  One  of  these  algorithms  generalizes 
Nelson’s  algorithm  for  solving  mimimax  decision  problems  [7], 
and  the  other  is  similar  in  approach  to  the  Arimoto-Blahut  al¬ 
gorithm  of  information  theory  [1,  2].  Finally,  we  consider  the 
analogous  problem  for  infinite-length  prediction  (i.e.,  p  =  oo), 
in  which  the  covariance  structure  is  specified  in  terms  of  an 
(uncertain)  spectral  density.  Results  analogous  to  those  of  for 
the  finite-length  case  are  developed,  and  it  is  noted  that  this 
infinite-length  version  of  the  problem  is  identical  mathemat¬ 
ically  to  an  optimization  problem  arising  in  portfolio  theory 

[4]. 
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Abstract  —  The  main  objective  in  universal  mod¬ 
eling  is  to  construct  a  process  for  a  class  of  model 
processes  which  for  long  strings,  generated  by  any  of 
the  models  in  the  class,  behaves  like  the  data  gener¬ 
ating  one.  Hence,  such  a  universal  process  may  be 
taken  as  a  representation  of  the  entire  model  class 
to  be  used  for  statistical  inference.  If  fix")  denotes 
the  probability  or  density  it  assigns  to  the  data  string 
x"  =  xi,. . .  ,Xn,  then  the  negative  logarithm  -  log  /(x”), 
which  may  be  viewed  as  the  shortest  ideal  code  length 
for  the  data  obtainable  with  the  model  class,  is  called 
the  stochastic  complexity  of  the  string,  relative  to  the 
considered  model  class.  Unlike  in  related  universal 
modeling,  where  the  mean  code  length  is  sufficient, 
we  also  need  an  explicit  asymptotic  formula  for  the 
stochastic  complexity.  This  is  because  it  permits  a 
comparison  of  different  model  classes  by  their  stochas¬ 
tic  complexity  in  accordance  with  the  MDL  (Mini¬ 
mum  Description  Length)  principle. 

The  MDL  principle  for  model  selection  and  statis¬ 
tical  inference  in  general  is  founded  on  the  idea  that 
the  strength  of  the  constraints  in  the  data,  imposed  by 
the  models,  can  be  measured  by  the  code  length  with 
which  the  data  can  be  encoded,  when  advantage  is 
taken  of  the  constraints.  This  gives  a  data  dependent 
criterion,  which  for  its  Justification  does  not  require 
the  untenable  assumption  that  the  observed  data  are 
generated  by  some  distribution.  Hence,  instead  of 
minimizing  a  distance  between  the  fitted  modeis  and 
the  nonexisting  ‘true’  distribution  we  just  search  for 
the  model  or  model  class  that  minimizes  the  code 
length. 

The  main  problem  in  the  implementation  of  the 
principle  is  how  to  estimate  the  shortest  code  length 
for  the  data,  given  a  suggested  model  class.  This  can 
be  difficult  requiring  ingenuity  and  hard  work  if  the 
class  of  models  is  complex.  Frequently  a  complex 
model  class  is  built  up  of  simpler  ones,  each  fitted 
to  a  portion  of  the  data,  so  that  the  total  code  length 
can  be  composed  of  the  stochastic  complexities  of  the 
components,  and  this  again  makes  a  formula  for  them 
useful.  The  seminal  case  is  the  class  consisting  of  just 
one  discrete  distribution  Fix"),  for  which  the  Shannon 
information  -  log  P(a;'*)  may  be  taken  to  represent  the 
shortest  (ideal)  code  length  among  all  prefix  codes  for 
a  data  sequence  of  a  fixed  length  in  the  sense  of  the 
noiseless  coding  theorem;  ie,  in  the  mean. 

The  most  important  classes  of  models  for  which  we 
can  derive  formulas  for  the  stochastic  complexity  are 

of  the  type  M*  =  {/(a;’'|^)},  or  M  =  IJt 

indexed  by  a  parameter  vector  9  =  6i , . . .  ,6k  and  satis¬ 
fying  the  marginality  condition  for  a  random  process. 
If  the  model  class  Mk  is  such  that  the  maximum  like¬ 
lihood  estimates  satisfy  the  Central  Limit  Theorem 
for  densities  for  each  6,  an  extention  of  the  noiseless 


coding  theorem  states  in  broad  terms  that  no  pro¬ 
cess  or,  equivalently,  code  exists  for  which  the  mean 
lengths  with  respect  to  for  the  various  values 

of  6  are  shorter  than  the  corresponding  mean  values 
of  the  stochastic  complexity 

-\nfix'‘\eix"))  +  !^\n^  +  \n  j  -f  o(l), 

except  for  ^  in  a  negligible  subset.  Here,  dix  )  de¬ 
notes  the  maximum  likelihood  estimate,  and  |/(^)|  is 
the  Fisher  information.  Moreover,  this  ideal  code 
length  up  to  terms  of  size  o(l)  is  given  by  the  neg¬ 
ative  logarithm  of 

J  f(x'‘l6(x’‘))dx’' 

special  cases  of  which  were  introduced  and  studied  in 
[1]  and  [2].  The  code  length  just  given  has  the  addi¬ 
tional  optimality  properties  for  universal  coding,  [1], 
[3],  that  strengthen  its  distinguished  status.  The  ex¬ 
tension  of  the  stochastic  complexity  formula  to  the 
larger  family  M  =  [J^  M*  is  straightforward,  and  for 
a  large  subclass  of  finite  alphabet  Markov  processes 
an  efficient  recursive  implementation  of  the  associated 
universal  process  is  possible,  [4].  However,  the  further 
extension  to  the  case  where  the  data  generating  pro¬ 
cesses  are  taken  as  nonparametric,  residing  in  a  suit¬ 
able  closure  of  the  class  M,  poses  difficulties.  Some 
progress  towards  evaluating  the  redundancy  has  been 
made  in  [5]  and  [6];  for  related  results  see  also  [7]. 
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Signal  Expansions  for  Compression 
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Abstract  —  Signal  expansions  play  a  key  role  in  prac¬ 
tical  compression  schemes,  from  audio/image/ video 
coding  standards  to  current  adaptives  bases.  Recent 
developments,  especially  related  to  wavelets  series  ex¬ 
pansions,  are  reviewed,  and  current  work  on  “best 
bases”  is  discussed. 

I.  Introduction 

Transform  coding,  together  with  predictive  coding,  is  a  key 
technique  used  in  many  practical  compression  systems.  Its 
foundation  is  based  on  the  Karhunen-Loeve  transform,  which 
is  the  optimal  transform  under  certain  constraints  [3].  In  prac¬ 
tice,  approximations  like  the  discrete  cosine  transform  (DCT) 
are  used,  both  for  computational  efficiency,  and  the  fact  that 
it  is  signal  independent  but  stiU  efficient  for  many  practical 
signals.  The  transform  coefficients  are  then  quantized  (usu¬ 
ally  in  a  scalar  fashion)  and  entropy  coded.  This  three  block 
system  [transform,  quantization,  entropy  coding]  raises  some 
interesting  questions: 

-  what  are  the  best,  possibly  adaptive,  transforms? 

-  what  is  the  interplay  of  the  three  components? 

-  can  successive  approximation  or  multiresolution  be  effi¬ 
ciently  achieved? 

Recently,  wavelets  and  their  generalizations  have  appeared 
as  alternatives  to  the  more  classic  Fourier  and  DCT  expan¬ 
sions  [2].  In  particular  adapted  expansions,  and  related  algo¬ 
rithms  to  find  the  best  bases,  are  an  interesting  extension. 

II.  Wavelet  Series  Expansions 

Classically,  windowed  Fourier  transforms  have  been  used  to 
obtain  time- frequency  representations  of  signals,  and  such  rep¬ 
resentations  are  useful  for  source  coding  as  well.  Alternatively 
to  local  Fourier  transforms,  wavelet  series  have  gained  popu¬ 
larity.  In  this  case,  a  particular  prototype  “mother”  wavelet 
ip{t)  is  used  to  generate  an  orthonormal  basis  {^m,n(t)}  for 
L2{R)  by  shifts  and  scales 

-  n)  m,n£  Z. 

A  main  difference  between  local  Fourier  expansions  and 
wavelet  series  is  that  they  provide  a  different  tiling  of  the 
time-frequency  plane.  For  example,  at  high  frequencies  or 
smaD  scales,  the  wavelet  is  very  sharp  in  time  and  acts  like  a 
mathematical  microscope.  In  discrete-time,  subband  coding 
and  filter  banks  permit  the  computation  of  sampled  equiva¬ 
lents  of  local  Fourier  and  wavelet  transforms  [7], 

III.  Adaptive  Best  Bases 

Obviously,  short-time  Fourier  transforms  and  wavelet  series 
are  only  two  out  of  a  myriad  of  possible  useful  tilings.  In  par¬ 
ticular,  wavelet  packets  [1]  and  their  time-varying  generaliza¬ 
tions  [4]  provide  signal  adaptive  orthonormal  bases.  When  the 
basis  selection  criterion  is  based  on  operational  rate-distortion, 
we  effectively  have  an  adaptive  transform  coding  algorithm, 

'This  work  was  supported  by  grant  NSF  MIP-93-21302. 


where  quantization  and  entropy  coding  are  included  in  the 
cost  function.  Since  such  transforms  are  adapted  on  the  fly, 
computational  efficiency  is  a  must.  In  [4],  a  tree  based  prun¬ 
ing  algorithm  is  used,  while  in  [8]  a  dynamic  programming 
procedure  is  applied. 

IV.  Overcomplete  Representations 

While  orthonormal  bases  have  many  desirable  properties  as 
expansions  for  compression,  a  major  drawback  is  their  lack  of 
shift-invariance.  Overcomplete  representations  or  frames  over¬ 
come  this  problem,  but  the  redundancy  hurts  compression.  A 
recent  result  from  oversampled  analog  to  digital  conversion  [6] 
indicates  that  fine  quantization  in  an  orthonormal  basis  can 
be  traded  for  coarse  quantization  in  an  overcomplete  repre¬ 
sentation.  Then,  we  discuss  the  use  of  matching  pursuits  [5] 
for  compression  applications.  In  matching  pursuit,  a  very  re¬ 
dundant  dictionary  is  used  together  with  a  greedy  algorithm 
to  find  a  best  approximation  to  a  given  signal.  Choices  of 
dictionaries  and  applications  in  video  coding  are  considered. 

V.  Conclusion 

A  survey  of  signal  expansions  in  the  context  of  transform- 
type  coding  was  given,  with  an  emphasis  of  wavelets,  adaptive 
and  overcomplete  representation.  Expansions  that  adapt  to 
the  signals  to  be  coded  are  a  step  towards  universal  transforms 
for  compression. 
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Abstract 

Large  deviations  theory,  a  branch  of  probability  theory 
that  deals  with  estimates  of  probabilities  of  very  rare 
events  has  close  links  with  topics  in  information  theory 
and  in  statistics.  We  shall  explore  some  of  these  connec¬ 
tions. 

A  sequence  {/i„}  of  Borel  probability  measures  on  A  satis¬ 
fies  the  Icirge  deviations  principle  (LDP)  with  a  rate  function 
/  :  A  -)■  [0,  oo]  if 


(d)  The  map  v  f  xdi^  :  Mi(E)  -4  E  contracts  Sanov’s 
theorem  to  Cramer’s  theorem  dealing  with  the  LDP  for 
the  empirical  means  Sn  =  ^  correspond¬ 

ing  result  for  weakly  dependent  Xi  (Gartner-Ellis  the¬ 
orem)  provides  a  large-deviations-based  proof  of  Shan¬ 
non’s  (noisy)  source  coding  theorem  (c.f.  [1,  Sec.  3.6]). 

Time  permitting,  other  relations  to  be  discussed  are  the 
use  and  value  of  large  deviations  in  non-parametric  and/or 
sequential  statistics  problems,  in  certain  (practical)  commu¬ 
nication  theory  problems  (c.f.  [7]),  and  in  the  study  of  fractal 
measures  and  sets. 


USD: 

limsup/iy’‘(F)  <  exp(-  inf  I{x))  VF  closed 

n-»oo 

LED: 

Urn  inf //^"'(G)  >  exp(-  inf  I(x))  VG  open 

n— >oo  '  leo 

and  [0,  a]  are  compact  sets  for  every  a  <  oo.  A  sequence  of 
r.v.  is  said  to  satisfy  the  LDP  if  their  laws  satisfy  the  LDP.  The 
similarity  of  the  LDP  and  the  definition  of  weak  convergence 
of  probability  measures  is  apparent. 

Indeed,  the  theory  of  large  deviations  is  soon  reaching  the 
state  of  maturity  of  weak  convergence  (for  example,  the  texts 
[1,2,3]  are  dedicated  to  the  former).  However,  from  an  appli¬ 
cation  point  of  view  these  two  theories  serve  complementary 
purposes.  While  weak  convergence  sheds  light  on  the  center  of 
the  distributions  fin  (i.e.  events  A  for  which  A‘n(A)  is  bounded 
away  from  zero),  large  deviations  theory  deals  primarily  with 
the  tails  of  /in- 

Perhaps  the  most  known  LDP  is  Sanov’s  theorem,  [1,  Sec. 
6.2],  stating  that  for  i.i.d.  E- valued  random  variables 
each  distributed  according  to  fi,  their  empirical  measures  L„  = 
n  satisfy  in  Mi(E)  (=  space  of  Borel  probability 

measures  on  E)  the  LDP  with  rate  function  Here 

H{-\-)  stands  for  the  relative  entropy  (also  known  as  the  KL 
divergence  or  cross  entropy). 
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^  =  /,  exists 
otherwise 


Ssmov’s  theorem  provides  a  clue  to  some  of  the  links  explored 
in  this  talk,  namely: 

(a)  The  method  of  types,  introduced  in  Information  Theory 
(c.f.  [4,5]),  allows  one  to  prove  Sanov’s  theorem  when  E 
is  a  finite  set  and  goes  much  beyond  this  simple  setup. 

(b)  The  decisive  role  of  the  relative  entropy  in  Sanov’s  the¬ 
orem  is  exemplified  in  the  use  of  information  inequali¬ 
ties  to  prove  statements  about  conditional  laws  (c.f.  [6]). 
The  notions  of  sufficient  statistics  and  of  universal  prior 
are  closely  related  to  these  conditional  laws. 

(c)  Sanov’s  theorem  yields  the  asymptotics  of  probability 
of  error  in  the  Hypothesis  Testing  problem  (c.f.  [1,  Sec. 
3.4]). 
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Abstract  — 

We  consider  a  computational  strategy  for  shape 
recognition  based  on  choosing  “tests”  one  at  a  time 
in  order  to  remove  as  much  uncertainty  as  possible 
about  the  true  hypothesis.  The  approach  is  compared 
with  other  recognition  paradigms  in  computer  vision 
and  illustrated  by  attempting  to  classify  handwritten 
digits  and  track  roads  from  satellite  images. 

I.  Overview 

We  explore  the  possibility  of  recognizing  shapes  “simply” 
be  asking  the  right  questions  in  the  right  order.  In  the  abstract 
formulation,  we  are  given  a  finite  list  of  possible  “hypotheses” ; 
exactly  one  of  these  is  true  and  we  wish  to  decide  which  it  is 
based  on  the  results  of  various  “tests”  or  “questions.”  There 
is  a  decision  tree  which  instructs  us  how  to  perform  the  tests 
and  eventually  classify  the  results.  Each  interior  node  of  the 
tree  is  assigned  one  of  the  tests  and  each  terminal  node  is 
assigned  one  of  the  hypotheses.  The  assignment  of  tests,  the 
“strategy,”  is  adaptive  in  the  sense  that  the  choice  of  the  test 
at  each  node  may  depend  on  the  test  values  observed  at  all 
preceding  nodes  along  the  same  branch.  Ideally,  the  choice 
would  be  driven  by  some  global  measure  of  efficiency,  such 
as  achieving  the  most  accurate  classifier  for  a  given  average 
number  of  tests,  or  reaching  the  fastest  decision  at  a  given 
level  of  accuracy.  But  these  problems  axe  intractable,  and 
we  shall  opt  instead  for  the  “greedy”  strategy  in  which  the 
tests  are  chosen  recursively  based  on  minimizing  the  expected 
entropy  of  the  updated  disribution  over  hypotheses  given  the 
test  results. 

We  have  applied  this  to  two  problems  in  shape  recognition, 
focusing  on  linear,  deformable  structures.  The  raw  data  is  a 
binary  or  grey-level  image,  the  tests  are  particular  “features” 
(i.e.,  image  functionals),  and  the  hypotheses  refer  to  particulcir 
shape  classes  or  spatial  positionings. 

II.  Roads 

This  application  is  joint  work  with  Bruno  Jedynak  of 
INRIA-  Rocquencourt.  We  describe  a  new  algorithm  [2]  for 
tracking  major  roads  from  panchromatic  SPOT  satellite  im¬ 
agery,  demonstrated  on  SPOT  images  of  size  6000  x  8000, 
representing  a  SOkm  x  80A:m  square  on  the  ground,  in  this 
case  in  southern  Prance. 

The  standard  construction  of  decision  trees  (e.g.,  in  coding, 
CART  [1],  and  machine  learning)  is  off-line,  nonparametric, 
and  based  on  “training  data.”  However,  in  our  formulation  of 
tracking,  it  is  impossible  to  pre-compute  and  store  the  entire 
decision  tree:  it  has  too  many  branches  from  eeich  (interior) 
node,  it  is  too  deep  (i.e.,  too  many  tests  are  needed  to  reach  a 
decision),  and  the  number  of  possible  road  locations  is  enor¬ 
mous.  (The  tests  are  local  matched  filters  indexed  by  image 
location  and  designed  to  respond  to  short  road  segments.)  In¬ 
stead,  the  entropy  strategy  is  implemented  on-line:  each  new 

^This  work  was  supported  by  NSF  Grant  DMS-9217655  and 
ONR  Contract  N00014-91-J-1021 


filter  is  chosen  during  the  actual  tracking  based  on  the  par¬ 
ticular  filter  results  previously  encountered;  in  other  words, 
we  only  compute  the  branch  of  the  tree  that  is  needed  for  the 
data  at  hand.  In  fact,  the  amount  of  time  necessary  to  per¬ 
form  the  tests  is  small  compared  to  determining  the  “right” 
test  to  perform.  On  the  other  hand,  compcired  to  maximum 
likelihood  estimation,  the  number  of  tests  actually  performed 
until  a  decision  is  made  is  exponentially  small;  indeed,  maxi¬ 
mum  likelihood  is  computationally  impossible. 

Our  approach  is  also  model-based  rather  than  nonparamet¬ 
ric.  As  a  result,  we  can  formulate  the  problem  of  minimizing 
entropy  in  explicit  and  relatively  simple  analytical  terms.  To 
execute  the  strategy  we  then  alternate  between  data  collec¬ 
tion  and  optimization:  at  each  iteration,  new  image  data  is 
examined  and  a  new  entropy  minimization  problem  is  solved 
(exactly)  resulting  in  a  new  image  location  to  inspect,  and  so 
forth.  This  will  be  illustrated  with  a  video. 

III.  Digits 

We  shall  also  briefiy  mention  another  application  -  the 
recognition  of  handwritten  numerals  -  which  is  co-authored 
with  Prof.  Yali  Amit  of  the  University  of  Chicago.  There 
cire  ten  hypotheses  and  the  strategy  is  again  constructed  by 
stepwise  entropy  reduction,  but  off-line  and  not  in  the  stan- 
dcird  Euclidean  framework.  Instead,  we  construct  relational 
classification  trees  based  on  accumulating  information  about 
a  graphical  representation  of  the  image  data  involving  planar 
arrangements  among  local  landmarks.  Actually,  we  construct 
many,  each  being  a  distribution-valued  test.  For  any  given 
training  set,  there  is  then  a  fundamental  trade-off  between 
the  tree  depth  and  tree  generality,  this  is  related  to  well-known 
issues  and  tradeoffs  in  computational  learning  and  computer 
vision.  Finally,  the  classification  rates  obtained  appear  to  be 
comparable  to  state-of-the-art  neural  networks  and  other  non¬ 
parametric  statistical  clsissifiers. 
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Originally  coming  from  physics,  maximum  en¬ 
tropy  (ME)  has  been  promoted  to  a  general  principle 
of  inference  primarily  by  the  works  of  Jaynes  [4]. 

ME  applies  to  the  problem  of  inferring  a  probabil¬ 
ity  mass  (or  density)  function,  or  any  non-negative 
function  p{x),  when  the  available  information  speci¬ 
fies  a  set  E  of  feasible  functions,  and  there  is  a  prior 
guess  E.  The  ME  solution  is  that  p*  e  E  which 
minimizes  the  information  divergence 

D{p  II  q)  =  log  -  P{x)  +  q{x)].  (1) 

For  probability  mass  functions,  if  q  is  imiform,  min¬ 
imizing  (1)  is  the  same  as  maximizing  the  entropy 
■H{p).  This  is  why  the  method  is  called  ME  also  in 
general. 

In  typical  applications,  the  available  information 
consists  in  linear  constraints  on  p,  i.e., 

E  =  {p:Y^pix)ai{x)  =  bi  ,  i  =  0,1, . . .  ,k}.  (2) 

Then  the  ME  solution  p*  (uniquely)  exists,  and 

k 

p*{x)  =  g(a:)exp  ai{x)  ,  (3) 

i=0 

providing  q  is  strictly  convex  and  E  contains  any 
strictly  positive  p.  In  the  non-discrete  case  (with 
sums  replaced  by  integrals),  the  existence  of  ME  so¬ 
lution  can  not  be  asserted  in  the  above  generality, 
although  a  unique  p*  always  exists,  possibly  not  in 
E,  such  that  D{pn  ||  p*)  0  for  every  {p„}  C  E 

with  D{pn  II  q)  vai  p^eD{p  ||  q);  this  p*  is  of  form 

(3),  cf.  [1]. 

We  will  review  the  arguments  that  have  been  put 
forward  for  justifying  ME.  In  this  author’s  opinion, 
the  strongest  theoretical  support  to  ME  is  provided 
by  the  axiomatic  approach.  This  shows  that,  in  some 
sense,  ME  is  the  only  logically  consistent  method 
of  inferring  a  function  subject  to  linear  constraints. 
This  approach  also  leads  to  alternatives  that  come 

^This  work  was  supported  by  OTKA  Grant  No.l906 


into  account  under  weaker  axioms,  cf.  [2].  Such  are 
the  methods  of  minimizing  an  /-divergence 

i>/(p|l9)  =  E9W/(f||y)  W 

or  a  Bregman  divergence 

Bf{p  II  q)  = 

E[/(p{a:))  -  f{q{x))  -  f'{q{x)){p{x)  -  9(2:))],  (5) 

where  /  is  a  strictly  convex  function.  Minimizing 
(5)  leads  to  scale  invariant  inference  if  f  =  fa  where 
fi{t)  =  tlogt-t,  fo{t)  =  -logt,  fa{t)  =  if  a  >  1 

of  q;  <  0,  fa{t)  =  if  0  <  a  <  1.  Inference  by 
minimizing  (5),  particularly  with  f  fai  been 

suggested  also  in  [5],  based  on  another  axiomatic 
approach. 

For  the  problem  of  attainment  of  the  minimum 
of  (4)  or  (5)  subject  to  p  E  E  (in  the  non-discrete 
case),  there  is  an  analogue  of  the  result  in  the  pas¬ 
sage  containing  (3).  It  depends  on  the  behavior  of  / 
at  infinity  whether  or  not  this  permits  to  give  simple 
sufficient  conditions  for  the  minimum  to  be  attained, 
cf.  [3]. 
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Abstract  —  The  capacity  of  the  channel  induced  by 
a  given  class  of  sources  is  well  known  to  be  an  at¬ 
tainable  lower  bound  on  the  redundancy  of  universal 
codes  w.r.t  this  class,  both  in  the  minimax  sense  and 
in  the  Bayesian  (maximin)  sense.  We  show  that  this 
capacity  is  essentially  a  lower  bound  also  in  a  stronger 
sense,  that  is,  for  “most”  sources  in  the  class.  This 
result  extends  RJssanen’s  lower  bound  for  parametric 
families.  We  demonstrate  its  applicability  in  several 
examples  and  discuss  its  implications  in  statistical  in¬ 
ference. 

In  universal  coding  w.r.t  a  given  class  of  sources  the  objec¬ 
tive  is  to  design  a  single  code  that  “performs  well”  for  every 
source  in  the  class.  The  sources  in  the  class  are  indexed  by  a 
variable  6  &  A.  The  performance  of  a  given  code  L,  is  judged 
on  the  basis  of  the  redundancy  which  is  defined  as  the  dif¬ 
ference  between  the  expected  code  length  of  L  w.r.t  a  given 
source  Pe  and  the  nth  order  entropy  of  Pe,  normalized  by  the 
length  n  of  the  input  vector. 

Two  important  notions  of  universality  are  the  maximin  uni¬ 
versality  and  the  minimax  universality  [1].  Gallager  [2]  was 
the  first  to  show  that  the  minimax  redundancy  and  the  max¬ 
imin  redundancy  are  equivalent  and  that  they  are  both  equal 
to  the  capacity  of  the  “channel”  whose  input  is  0  and  whose 
output  is  the  random  source  vector  A'"  =  (A't, ..., A^n).  In 
particular,  for  parametric  families  where  S  is  a  i-dimensional 
vector,  the  minimax  redundancy,  and  hence  also  the  maximin 
redundancy  and  the  capacity  of  the  corresponding  channel, 
was  shown  to  be  essentially  0.5fclogn/n. 

Rissanen  [3]  has  strengthened  the  notion  of  universality 
w.r.t  parametric  families  by  showing  that  0.5A'logn/n  is  not 
only  an  achievable  lower  bound  in  the  minimax  sense,  but 
also  a  lower  bound  for  “most”  sources  in  the  class.  Here  by 
“most”  sources  we  mean  every  point  B  except  for  a  subset  of 
points  whose  Lebesgue  measure  vanishes  as  n  grows.  Rissa¬ 
nen 's  proof,  however,  relies  heavily  on  the  structure  of  the 
parametric  family  and  essentially  the  main  insight  that  can 
be  gained  from  his  work  is  that  the  redundancy  is  strongly  re¬ 
lated  to  the  richness  of  the  class,  which  in  the  parametric  case 
is  proportional  to  the  dimension  k  of  the  parameter  vector. 

It  turns  out  thar  Rissanen’s  stronger  notion  of  universal¬ 
ity  extends  to  the  general  case  where  the  class  of  sources  is 
not  necessarily  a  parametric  family.  Specifically,  we  show  that 
the  Shannon  capacity  of  the  induced  channel  is  a  lower  bound 
on  the  redundancy  that  holds  simultaneously  for  all  sources 
in  the  class  except  for  a  subset  of  points  whose  probability, 
under  the  capacity-achieving  probability  measure,  is  vanish¬ 
ing  as  n  tends  to  infinity.  This  means  that  the  minimax  re¬ 
dundancy  and  the  lower  bound  essentially  coincide  for  most 
choices  of  6.  Moreover,  if  the  capacity-achieving  probability 
density  happens  to  be  positive  almost  everywhere  (Lebesgue), 
as  is  normally  the  case  in  parametric  families,  the  above  result 
holds  also  for  most  sources  in  the  Lebesgue  measure  sense  and 


therefore  Rissanen’s  result  is  obtained  as  a  special  case. 

The  proof  is  completely  different  and  considerably  simpler 
than  Rissanen’s  proof  [3].  However,  it  does  not  allow  a  free 
choice  of  any  prior,  other  than  the  capacity-achieving  prior, 
that  might  be  reasonable  as  well  for  weighting  the  set  of  points 
that  violate  the  bound.  We  next  provide  another  variant  of  our 
result  which  permits  any  prior  on  the  index  set,  but  then  the 
random  coding  capacity  of  the  induced  channel  rather  than  its 
Shannon  capacity  is  obtained  as  a  lower  bound.  Here  the  ran¬ 
dom  coding  capacity  refers  to  the  normalized  logarithm  of  the 
maximum  number  M  of  randomly  chosen  points  ^i,...,^a/, 
which  form,  with  high  probability,  a  set  of  distinguishable 
sources  Fe,,...,PeM-  For  most  cases  of  practical  interest  the 
Shannon  capacity  and  the  random  coding  capacity  are  equiv¬ 
alent  and  hence  the  resulting  bound  is  virtually  as  tight.  We 
believe  that  another  advantage  of  this  random  coding  capacity 
result  is  that  it  may  add  some  new  insight  about  the  relation 
between  redundancy  and  capacity.  Specifically,  in  the  proof 
of  this  result  the  redundancy  is  linked  directly,  not  only  to 
the  mathematical  notion  of  capacity  as  the  maximum  mutual 
information,  but  also  to  the  operational  notion,  i.e.,  the  max¬ 
imum  achievable  rate  of  reliable  communication. 

The  results  above  have  a  broader  significance  in  statistical 
inference.  In  the  absence  of  knowledge  about  the  true  under¬ 
lying  class  member  Pe,  the  statistician  wishes  to  construct  a 
single  universal  probability  measure  Q  that  “explains  well 
the  data.  His  task  is  successful  if  Q  is  simultaneously  "close" 
to  every  source  in  the  class  ,  where  distance  is  measured  in 
terms  of  the  divergenece  D(Pe\\Q)-  In  this  context,  our  main 
result  is  the  following  attainable  lower  bound:  For  all  6  €  A. 
except  for  points  in  a  subset  B  CA  whose  probability,  under 
the  capacity-achieving  prior,  vanishes  as  n— -  oo, 

Z?(Fsl|C?)  =  £slog^^>(l-OC„ 

where  C„  is  the  capacity  of  the  channel  from  6  to  A".  Thus, 
a  necessary  and  sufficient  condition  that  the  statistician  task 
will  asymptotcally  succeed  is  that  C„/it  —  0.  Note  that,  un¬ 
like  the  lossless  data  compression  problem,  this  setting  applies 
to  the  continuous  alphabet  case  as  well.  This  point  of  view 
provides  a  general  framework  for  the  choice  of  a  statistical 
model  in  the  presence  of  uncertainty,  which  can  be  used  for 
other  decision  making  problems  as  well,  like  universal  gam¬ 
bling,  portfolio  selection,  and  prediction. 
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Abstract  —  An  important  class  of  universal  encoders 
is  the  one  where  the  encoder  is  fed  by  two  inputs:  a) 
The  incoming  string  of  data  to  be  compressed,  b) 
An  N-bit  description  of  the  source  statistics  (i.e.  a 
“training  sequence”).  We  consider  Fixed-to- Variable 
universal  encoders  that  noiselessly  compress  blocks  of 
length  £. 

Two  problems  will  be  addressed: 

1.  The  Minimum  Training-sequence  length,  iVmin(f): 
Given  a  class  of  admissible  stationary  sources,  find  the 
minimal  length  of  a  training  sequence  needed  in  order  to 
guarantee  that  any  source  in  the  given  class,  with  an  1-th 
order  entropy  Hi  <  log  A,  will  yield  some  compression 
(A  is  the  alphabet  size). 

2.  An  Optimal  Universal  Encoder  (UE): 

Find  a  UE  that  ’’ensures”  that  the  compression  for  EV¬ 
ERY  source  in  the  given  class  is  close  to  the  minimal 
possible  compression  He,  once  the  training  sequence  is 
longer  than  Ymin(f). 

The  first  case  to  be  considered  is  the  one  where  the  train¬ 
ing  sequence  and  the  incoming  data  string  are  assumed  to  be 
statistically  independent. 

A  Converse  Theorem  (solving  problem  (1))  and  a  Cod¬ 
ing  Theorem  (solving  problem  (2))  are  given  for  the  class 
of  finite-alphabet  stationary  sources  with  a  vanishing  mem- 
ory(i.e.  sources  that  satisfy  a  certain  mixing  condition  [1], 
[3]). This  class  includes  all  finite-order  Markov  sources. 

Another,  perhaps  more  practical  case  is  the  one  where  the 
training  sequence  consists  of  the  last  N  bits  of  the  data  that 
has  been  processed. (i.e.  a  “sliding  window”  algorithm). 

For  any  stationary  source  P  over  an  alphabet  of  A  letters, 
let  Bn  =[X-j  :  j  =  max[i  :  P{X°i  >  l/N)-,i  =  — 1,0, 1, 2, ...]] 
and  define  the  conditional  entropy  Hf {Xi\Xlj)  which 
is  monotonically  decreasing  with  N  ,  and  satisfies  H  < 
Ht^ {Xi\X?.j)  <  Hi.  It  is  demonstrated  that  (for  large  N) 
the  length  of  the  training  must  be  bigger  than  N,  or  else,  for 
any  universal  FV  encoder  for  £- vectors  there  exists  at  least  one 
stationary  source  with  H^  (Af|X°j)  <  R  <  logA  for  which 
the  compression  is  logA-e  .  Here  e  >  0  and  0  <  R  <  logA 
are  arbitrary  and  I,  the  length  of  the  source  words,  must  be 
of  order  between  loglogN  and  logN. Conversely,  we  describe  a 
compression  scheme  that  yields  a  compression  that  is  arbitrar¬ 
ily  close  to  H^ (XilXlj)  for  every  stationary  source,  provided 
that  the  length  of  the  sequence  is  larger  than  N  (i.e.  the  length 
is  at  least  where  €  is  arbitrarily  small). 

The  coding  theorems  are  based  on  variants  of  the  Lernpef- 
Ziv  data-compression  algorithm  [2]. 
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Abstract  —  We  determine  the  asymptotic  minimax 
redundancy  of  universal  data  compression  in  a  para¬ 
metric  setting  and  show  that  it  corresponds  to  the  use 
of  Jeffreys  prior.  Statistically,  this  formulation  of  the 
coding  problem  can  be  interpreted  in  a  prior  selection 
context  and  in  an  estimation  context. 


I.  Introduction 

Here  we  exploit  a  relationship  between  coding  in  informa¬ 
tion  theory  and  risk  in  statistics.  In  source  coding  one  often 
wants  to  minimize  the  redundancy  of  the  code  and  in  channel 
coding  one  often  wants  to  achieve  a  high  rate  of  transmis¬ 
sion.  These  two  goals  can  be  defined  in  terms  of  the  relative 
entropy  between  two  distributions.  The  relative  entropy  be¬ 
tween  distributions  can  also  be  used  as  a  loss  function  in  a  de¬ 
cision  theory  context.  Since,  roughly  speaking,  a  source  code 
corresponds  to  an  estimator  for  an  unknown  distribution  the 
statistical  implication  is  that  one  can  seek  Bayes  estimators 
and  a  least  favorable  prior  which  has  desired  noninformativity 
properties. 


II.  Main  results 

If  one  has  data  from  a  source  distribution  with  density  given 
by  pe(x")  =  n"_ipe(xi)  where  pe  is  a  member  of  a  smooth 
parametric  family  and  one  has  a  continuous  density  w  on  the 
parameter  space  then  the  Bayes  code  achieves  the  minimum 
of  the  Bayes  redundancies  f  w{6)D{pe\\qn)d6  over  all  distri¬ 
butions  Qn ,  where  D  is  the  relative  entropy.  The  Bayes  redun¬ 
dancy  for  a  code  is  the  same  as  the  Bayes  risk  for  the  estimator 
corresponding  to  that  code.  Thus,  the  Bayes  code  is  based  on 
the  mixture  density  m(x'*)  =  f  w{9)pe{x’^)d0.  Equivalently, 
m  can  be  regarded  as  a  Bayes  estimator,  with  risk  D{p0\\Tn„), 
and  Bayes  risk  f  w{6)D{pg\\mri)d6.  Maximizing  the  Bayes  re¬ 
dundancy  over  choices  of  w  gives  the  maximin  redundancy,  or 
maximin  risk.  One  can  therefore  identify  a  maximin  estima¬ 
tor  or  a  maximin  code,  formed  from  the  mixture  distribution 
with  respect  to  the  choice  of  w  achieving  the  maximal  Bayes 
redundancy. 

Alternatively,  one  might  seek  a  code  or  estimator  which 
minimizes  the  worst  case  redundancy.  Game  theory  suggests 
that  such  a  minimax  procedure  will  be  the  same  as  the  max¬ 
imin  procedure.  This  turns  out  to  be  the  case  and  one  can 
identify  a  least  favorable  prior  w. 

Our  main  results,  see  [3],  are  as  follows.  First,  the  redun¬ 
dancy  of  the  Bayes  code  i.e.,  the  risk  of  the  Bayes  estimator 
rrin  is 


R„{e,w)  =  ^log  ^+log 


Vim 

w{0) 


+  o(l), 


uniformly  for  9  in  compact  sets  K  in  the  interior  of  the  pa¬ 
rameter  space,  where  1(9)  is  the  Fisher  information  matrix. 
Second,  the  Bayes  redundancy  of  the  Bayes  code,  i.e.,  the 
Bayes  risk  of  the  Bayes  estimator  is 


Third  the  asymptotic  minimax  redundancy,  i.e.,  the  minimax 
risk  is 

Rn  =  ^log  ^  +1°S  /  V\V^\d9  +  o(l)  =  Rn, 

where  R*  is  the  maximin  redundancy,  or  the  maximin  risk. 
Finally,  the  least  favorable  prior  is  seen  to  be  Jeffreys  prior 
given  by  \J(9)\^^^/  II(9)l^^^d9,  see  [5].  With  this  choice  the 

redundancy,  or  risk,  achieves  asymptotically  the  same  value 
Rn(9,  w)  —  Rn,  uniformly  for  9  in  compacta  interior  to  K. 

The  associated  codelength  takes  the  form 

log— =log— +  [  \/|I(^)|d(9-ho(l) 

m(x")  pg(x")  2  2rr  J 

where  9  is  the  maximum  likelihood  estimator  see  [4].  This 
mixture  codelength  has  been  suggested  for  use  in  a  suitable 
formulation  of  the  minimum  description  length  principle  for 
model  selection,  see  [1],  [6]. 

III.  Implications 

This  least  favorable  prior  has  an  interpretation  in  chan¬ 
nel  coding.  The  Bayes  risk  Rn(w)  is  the  Shannon  mutual 
information  I(&;X”).  This  corresponds  to  the  channel  in 
which  9  is  sent  to  n  receivers  who  decode  the  message  x" 
together.  Asymptotically,  the  source  achieving  the  channel 
capacity,  maxu. /(©;  A"),  is  Jeffreys’  prior. 

In  statistics,  the  distribution  achieving  the  maximal  mutual 
information  is  called  a  reference  prior.  The  results  for  contin¬ 
uous  parameters  provide  formal  verification  of  a  conjecture 
in  [2],  that  in  the  absence  of  nuisance  parameters,  reference 
priors  are  Jeffreys  priors.  Equivalently,  one  can  write  the  mu¬ 
tual  information  as  £’mD(w(-|A’‘)||w(-)).  The  w  maximizing 
this  quantity  asymptotically  gives  a  posterior  as  far  as  possi¬ 
ble  from  the  prior  on  average.  That  is,  it  is  the  prior  which 
leaves  the  most  to  be  learned  from  the  data  and  so  represents 
minimal  informativity. 
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Abstract  —  This  paper  focuses  on  lower  bound  re¬ 
sults  on  expected  redundancy  for  universal  compres¬ 
sion  of  iid  data  from  parametric  and  nonparametric 
families.  Two  types  of  lower  bounds  are  reviewed. 
One  is  Rissanen’s  almost  pointwise  lower  bound  and 
its  extension  to  the  nonparametric  case.  The  other  is 
minimax  lower  bounds,  for  which  a  new  proof  is  given 
in  the  nonparametric  case. 

I.  Introduction 

One  important  ingredient  of  Rissanen’s  Stochastic  Com¬ 
plexity  theory  is  his  (almost)  pointwise  lower  bound  on  ex¬ 
pected  redundancy  for  regular  parametric  models  (cf.  [4]), 
and  a  minimax  counterpart  follows  from  [2].  By  expressing 
expected  redundancy  in  terms  of  accumulated  expected  pre¬ 
diction  errors,  a  similar  lower  bound  was  proved  in  [5]  and  [7] 
on  expected  redundancy  for  a  smooth  nonparametric  clciss  of 
densities.  This  lower  bound  was  shown  in  two  different  senses: 
one  extending  the  parametric  pointwise  bound  to  an  “artifi¬ 
cial”  parameter  space  with  a  dimension  depending  on  the  sam¬ 
ple  size  ([5]),  and  the  other  in  the  minimax  sense  ([7]).  In  this 
paper,  we  review  these  lower  bounds  and  the  methods  used  to 
prove  these  lower  bounds.  Finally  we  provide  a  new  proof  for 
the  lower  bound  in  the  nonparametric  case.  This  new  proof 
is  information-theoretic,  bypassing  the  detour  to  accumulated 
prediction  errors,  although  we  do  borrow  calculations  from  the 
density  estimation  literature. 

II.  Rissanen’s  Lower  Bound  and  its 
Nonparametric  Extension 

For  a  given  iid  data  string  xi,  X2, ...,  Xn  and  without  know¬ 
ing  the  distribution  /  which  generated  the  data,  we  would  like 
to  compress  the  data  in  an  efficient  way.  When  /(x)  =  f{x\ff) 
belongs  to  a  smooth  k  dimensional  parametric  family  such 
that  the  parameter  9  can  be  estimated  at  the  rate,  Ris- 

sanen  [4]  showed  that  we  need  at  least  H{f)  +  f  bits 
per  data  point,  asymptotically.  This  lower  bound  holds  in 
expectation  and  it  holds  for  almost  all  parameter  values  in 
the  parameter  space.  With  a  prefix  code  achieving  this  lower 
bound,  Rissanen  justified  that  can  be  viewed  as  the 

coding  complexity  measure  of  the  model  class. 

When  /  is  known  to  be  in  the  smooth  nonparametric  den¬ 
sity  class  of  bounded  derivatives  on  [0,1],  a  complexity  rate 
measure  of  was  established  in  [5]  by  embedding  the  non¬ 

parametric  class  in  a  parametric  class  of  dimension  of  order 
n*^®/logn.  This  embedding  reflects  the  fact  that  a  smooth 
nonparametric  class  is  in  essence  a  parametric  class. 

III.  The  Minimax  Lower  Bounds 

Through  the  minimax  theorem  (cf.  [3]  ),  the  minimax  ex¬ 
pected  redundancy  over  a  parametric  class  is  equivalent  to  the 
maximum  of  the  Bayes  redundancy  which  is  the  same  as  the 
mutual  information.  Fortunately,  for  a  given  prior,  the  Bayes 
code  is  the  mixture  density  with  respect  to  that  prior  and  the 
first  term  in  the  expansion  of  the  Bayes  redundancy  or  mutual 


information  is  obtained  in  [2]  to  be  the  Rissanen  coding  com¬ 
plexity  of  I  Hence  this  complexity  measure  also  serves 
as  the  minimax  lower  bound  on  expected  redundancy. 

For  the  nonparametric  class  mentioned  above,  the  minimax 
theorem  still  holds;  therefore  any  Bayes  redundancy  or  mutual 
information  provides  a  lower  bound.  However,  no  prior  seems 
to  exist  on  the  whole  density  class  for  which  the  Bayes  redun¬ 
dancy  can  be  approximated  analytically.  On  the  other  hand, 
the  expected  rednndancy  is  simply  the  accumulated  prediction 
or  estimation  error  in  terms  of  KnUback-Leibler  divergence, 
and  techniques  have  long  been  developed  to  obtain  minimax 
lower  bounds  on  density  estimation  errors  in  the  nonpara¬ 
metric  case,  cf  [6].  By  lower  bounding  the  divergence  by  the 
Hellinger  distance  and  borrowing  Assouad’s  technique,  a  min¬ 
imax  rate  lower  bound  of  was  established  in  [7]. 

Note  that  in  applying  Assouad’s  technique,  one  does  not 
calculate  the  Bayes  estimation  error  over  the  whole  class,  but 
only  over  a  conveniently  chosen  hypercube  sub-class,  and  the 
Bayes  estimation  error  over  this  sub-clciss  provides  a  lower 
bound  on  the  minimax  estimation  error.  It  turns  out  that 
this  detour  to  accumulated  prediction  or  estimation  error  is 
not  necessary  since  we  can  use  the  hypercube  sub-class  di¬ 
rectly  with  the  redundancy.  Using  a  result  from  the  density 
estimation  literature  ([1]),  it  can  be  shown  that  gives 

the  rate  in  a  lower  bound.  Thus  we  obtain  a  new  proof  for  the 
minimax  rate  lower  bound  in  [7],  and  this  new  line  of  proof 
applies  to  other  smooth  classes  of  densities. 

Superficially,  the  proof  in  the  parametric  case  has  a  con¬ 
tinuous  flavor  since  it  relies  on  nice  continuous  priors  on  the 
whole  parameter  space,  whereas  the  proof  in  the  nonpara¬ 
metric  case  has  a  discrete  flavor  because  of  the  hypercube 
sub-class  it  relies  on.  In  essence,  however,  the  former  is  also 
discrete  since  the  continuous  prior  can  be  replaced  by  a  dis¬ 
crete  uniform  prior  sitting  on  a  grid  sub-set  of  the  parametric 
space,  as  long  as  the  grid-size  is  of  the  order  or  smaller  than 
.  Note  that  the  nearest  neighbors  on  the  hypercube  also 
have  Hellinger  distances  of  order  n“^^^-the  rate  at  which  n 
iid  data  points  can  possibly  distinguish  two  distributions. 
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Abstract  —  A  condition  on  a  class  of  processes  guar¬ 
anteeing  that  the  weak  redundancy  rate  has  the  same 
asypmtotic  order  of  magnitude  as  the  strong  redun¬ 
dancy  rate  will  be  discussed. 

I.  Introduction 

In  recent  papers,  examples  were  constructed  showing  that 
there  is  no  nice  weak  redundancy  rate  for  various  subclasses 
of  the  class  of  ergodic  processes,  [1,  2,  3],  In  each  of  these  it 
was  shown  how  to  find  a  process  in  the  class  whose  n-th  order 
reduncancy  was  large,  then  it  was  shown  how  to  make  small 
changes  to  produce  a  process  whose  m-th  order  redundancy 
was  large,  for  some  m  >  n.  A  suitable  passage  to  a  limit 
then  produced  the  desired  example.  The  two  essential  features 
were  the  existence  of  processes  with  large  redundancy  in  a 
neighborhood  of  any  member  of  the  class  and  a  completeness 
property  to  insurance  the  existence  of  a  limit.  The  purpose  of 
this  talk  is  to  formalize  these  two  features,  in  order  to  clarify 
the  prior  constructions  and  extend  them. 

II.  Notation  and  terminology. 

The  (expected)  redundancy  of  a  prefix  n-code  Cn,  relative 
to  a  process  P,  is  defined  by 

iZ(C7„|P)  =  E{Ln\P)  -  Hn{P), 
where  E{L„\P)  is  expected  code  length  and 
H„(P)  =  -  ^  P(ar)  log 

“r 

is  the  n-th  order  entropy  of  P.  The  minimax  expected  redun¬ 
dancy  for  a  class  5  of  stationary  processes  with  alphabet  A  is 
defined  by 

Rn{S)  =  min  max  P(C„|P), 

Cji  pgs 

where  the  minimum  is  over  all  binary  prefix  n-codes. 

A  sequence  {Cn}  is  called  a  prefix-code  sequence  if  Cn  is 
a  binary  prefix  n-code,  for  each  n.  A  nondecreasing  function 
n  p(n)  is  called  a  strong  rate  for  the  class  5  if  there  is  a 
constant  M  such  that  Rn{S)  <  M p(n),  n  >  1,  and  if,  for  any 
prefix-code  sequence  {Cn}  and  any  function  ^(n)  =  o(p(n)), 
there  is  a  member  P  €  <5  such  that  P(Cn|P)/^(n)  is  un¬ 
bounded. 

A  nondecreasing  function  n  i— *•  p(n)  is  called  a  weak  rate 
for  the  prefix-code  sequence  {Cn}  on  the  class  S  if  for  each 
P  G  5  there  is  a  finite  number  M  =  M{P)  such  that 

Pn(Cn|p)  <  Mp{n),  n  >  1,  (1) 

and  if,  for  any  prefix-code  sequence  {Cn}  and  any  func¬ 
tion  <f>{n)  =  o(p{n)),  there  is  a  member  P  G  5  such  that 
R{C„\P)/<i>{n)  is  unbounded.  (Note  that  weak  rates  allow  the 
constant  M  to  depend  on  P.) 

^Peirtially  supported  by  NSF  grcint  DMS-9024240  and  MTA-NSF 
project  37. 


Let  d{ P,  Q)  be  a  metric  on  a  class  S  of  stationary  processes, 
and  let  N^P)  =  {Q  e  S:  d{P,Q)  <  e}  denote  the  e  neigh¬ 
borhood  of  P.  The  metric  space  (5,  d)  is  called  locally  rich  if 
Nc{P)  and  S  have  the  same  strong  rate,  for  every  P  G  <S  and 
£  >  0.  The  following  theorem  will  be  proved. 

III.  The  weak-rate/strong-rate  theorem. 

If  S  has  a  locally  rich,  complete  metric  d  such  that  d- 
convergence  implies  weak  convergence  and  convergence  in  en¬ 
tropy,  then  the  weak  rate  and  strong  rates  for  S  are  the  same. 

Proof:  Select  P"^')  G  S  such  that  d(P"('+^\ P"<‘))  <  and 
such  that  P(C„(i)|P"'’^)  >  Mj(j>(n{j)),  j  <  i,  where  Mi  -*  oo, 
and  6i  <  oo. 


IV.  Summary  table. 


Class 

Strong  rate 

Weak  rate 

i.i.d. 

log  n 

logn 

Markov  (k) 

logn 

logn 

Finite  state  (M) 

logn 

log  n 

Renewal 

n  ' 

n  ' 

M-Renewal  (k) 

„(k+l)/(fc+2) 

„('=+l)/('=+2) 

Regenerative 

n 

n 

B-processes 

n 

n 

Ergodic  processes 

n 

n 

Ut  Markov  (k) 

n 

log  n 

Finite  state 

n 

log  n 

Renewal/finite 

n  ' 

logn 

M-Renewal  (k)/finite 

^(fc+U/Cfc+z) 

logn 

Regenerative/finite 

n 

logn 

U*  M-Renewal  (k) 

n 

n/log  n* 

Class/  finite  =  finitely  many  waiting  times. 
*-upper  bound  only. 


Metrics: 

1.  Markov  (k):  (it  4-  l)-order  variational  distance. 

2.  Renewal:  d{P,Q)  =  “  <3(0I>  where  P{t)  = 

probability  that  waiting  time  is  t. 

3.  Regenerative: 

D{P,Q)  =  d(P,Q)+E.Ea; 

where  P(  =  <-order  distribution  given  that  waiting  time 
is  t,  and  d{P,  Q)  is  the  renewal  distance. 

4.  B-processes:  d-metric. 

5.  Ergodic  processes:  /-metric. 
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Abstract  —  My  talk  is  a  survey  of  binary  tree- 
structured  methods  for  clustering  as  they  apply  to 
predictive,  pruned,  tree-structured  vector  quantiza¬ 
tion  (predictive  PTSVQ).  Much  of  the  material  con¬ 
cerns  applications  of  PTSVQ  to  the  lossy  coding  of 
digital  medical  images,  especially  CT  and  MR  chest 
scans.  There  is  a  brief  introduction  to  the  asymp¬ 
totic  properties  of  the  algorithms  and  to  the  attempt 
to  understand  variability  and  covariability  of  amino 
acids  in  the  V3  loop  region  of  HIV.  The  research  has 
been  collaborative  with  many  others  over  a  five  year 
period. 

The  algorithms  involve  successively  partitioning  the  range 
of  a  set  of  pixel  vectors  X,  and  can  be  viewed  as  successive 
two-means  clustering.  When  X  €  the  partitioning  is  by 
hyperplanes;  qualitative  data  can  be  handled,  too,  as  in  the 
application  to  HIV  amino  acid  sequences.  Results  are  sum¬ 
marized  by  a  binary  tree;  a  pixel  vector  X*  to  be  coded  is 
passed  from  the  root  node  successively  to  a  terminal  node  (t). 
The  codeword  assigned  is  simply  a  suitably  defined  centroid 
of  learning  sample  X  values  at  t.  The  bit  rate  is  the  average 
depth  of  the  tree.  Splitting  is  always  “greedy,  ”  in  senses  to  be 
described.  We  grow  an  initial  tree  larger  than  we  intend  to  use 
and  prune  it  back  to  smaller  ones.  Every  subtree  of  the  large 
initial  tree  has  its  own  figure  of  merit  and  assigned  “penalty” 
for  complexity.  For  a  given  penalty,  there  is  a  unique  smallest 
primed  subtree  of  the  cited  initial  large  tree  that  is  optimal 
in  terms  of  figure  of  merit.  As  the  penalty  increases,  the  se¬ 
quence  of  optimally  pruned  subtrees  is  nested.  Codes  that 
correspond  to  the  sequence  of  optimally  pruned  subtrees  are 
thus  seen  to  have  a  natural  progressive  property. 

Versions  of  these  algorithms  can  be  shown  to  be  “consis¬ 
tent”  [4]  in  a  sense  to  be  described.  Part  of  the  argument 
involves  showing  that  the  algorithm  terminates  when  it  is  ap)- 
plied  with  a  bit  rate  constraint  to  a  fixed  absolutely  contin¬ 
uous  distribution  with  compact  support.  Next,  a  continuity 
property  is  established  relative  to  a  fixed,  convergent  sequence 
of  distributions.  Finally,  aspects  of  empirical  processes  are 
brought  to  bear  upon  the  large  sample  behavior  of  the  algo¬ 
rithm  when  the  learning  sample  is,  beyond  the  cited  assumpK 
tions,  stationary  and  ergodic. 

Applications  of  PTSVQ  to  problems  of  data  compression 
in  digital  radiography  [1-3,  6]  are  reported.  One  set  of  prob¬ 
lems  involved  the  detection  of  lung  lesions  and  mediastinal 
adenopathy  from  12  bit  per  pixel  (bpp)  original  CT  images. 
The  pixel  intensities  coded  were  not  those  of  the  original  im¬ 
ages,  but  rather  of  the  residuals  when  pixel  blocks  are  pre¬ 
dicted  by  a  simple  Wiener-Hopf  technique  from  previously 
encoded  blocks.  (See  [5]  for  improvements  that  involve  seg¬ 
mentation,  increasing  the  memory  of  the  predictor,  and  ridge 
regression.)  Thirty  images  of  each  type  were  compressed  to 
six  different  levels.  Three  radiologists  then  used  the  original 
and  compressed  images  for  diagnosis.  We  quantify  outcomes 
by  sensitivity  (the  chance  an  object  is  detected  given  that  it 


is  there)  and  predictive  value  positive  (the  chance  that  a  de¬ 
tected  object  is  actually  there).  Presumably  larger  bit  rates 
are  better,  though  the  data  do  not  bear  this  out  for  bit  rates 
more  than  2  bpp.  Plots  of  outcome  versus  bit  rate  are  fit 
by  quadratic  splines  with  a  single  knot  and  surrounded  by 
bootstrap-based  simultaneous  confidence  regions.  On  the  ba¬ 
sis  of  various  analyses  of  the  data  we  conclude  that  images 
can  be  compressed  to  bit  rates  between  one  and  two  bits  per 
pixel  without  significant  loss  of  diagnostic  accuracy.  From  a 
different  clinical  study  we  have  concluded  that  MR  chest  scans 
originally  9  bpp  and  used  for  measuring  vessels  in  the  chest 
can  be  compressed  to  .55  bpp  without  apparent  loss  of  clinical 
accuracy  [6].  Radiologists  seem  to  like  somewhat  compressed 
images  better  than  they  like  the  originals  [3]. 
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Abstract  —  We  establish  the  Bayes  risk  consis¬ 
tency  of  an  unsupervised  greedy-growing  algorithm 
that  produces  tree-structured  classifiers  from  labeled 
training  vectors.  The  algorithm  employs  a  composite 
splitting  criterion  equal  to  a  weighted  sum  of  Bayes 
risk  and  Euclidean  distortion. 

I.  Introduction 

Binary  trees  play  an  important  role  in  the  methodology 
of  Statistics  and  Information  Theory.  Classification  trees  are 
nsed  in  a  wide  variety  of  statistical  problems;  tree-structured 
vector  quantizers  provide  an  efficient  and  effective  means  of 
compressing  images. 

A  critical  problem  in  practice  is  how  to  design  a  good  tree- 
structured  classifier  or  quantizer  from  a  finite  data  set.  Greedy 
growing  algorithms  [1,2,3]  produce  suitable  trees  one  node  at  a 
time,  optimizing  a  specified  splitting  criterion  at  each  step.  In 
spite  of  their  empirical  success,  there  has  been  little  theory  to 
support  the  unsupervised  use  of  greedy  growing  algorithms,  or 
to  examine  the  behavior  of  such  algorithms  on  large  training 
sets. 

We  establish  the  Bayes  risk  consistency  of  an  unsupervised 
variant  of  the  CART  [2]  algorithm.  The  algorithm,  which  em¬ 
ploys  a  composite  splitting  criterion  equal  to  a  weighted  sum 
of  Bayes  risk  and  Euclidean  distortion,  is  motivated  by  re¬ 
cent  work  [1]  on  the  design  of  joint  quantization/classification 
schemes.  Variance  of  the  classifiers  is  controlled  by  limiting 
the  number  of  splits,  rather  than  by  pruning  an  ‘overgrown’ 
tree. 

II.  Definitions 

A  tree-structured  partition  is  described  by  a  pair  (T,  a) 
where  T  is  a  binary  tree  and  a  :  T  assigns  a  splitting 

vector  to  every  node  of  T.  Let  T  denote  the  terminal  nodes 
of  r.  Each  vector  x  €  IR'^  is  associated  with  a  member  of  T 
through  a  sequence  of  binary  comparisons  that  trace  a  path 
through  T:  beginning  at  the  root,  and  at  each  subsequent  in¬ 
ternal  node,  X  moves  to  that  child  of  the  current  node  whose 
label  is  nearest  to  i  in  Euclidean  distance.  (A  tie-beaking 
scheme  may  be  used  to  avoid  ambiguities.)  Let  Vt  be  the  set 
of  vectors  x  whose  path  contains  the  node  t.  Then  each  Vt 
is  a  convex  polytope,  and  the  the  collection  {Vj  :  t  €  T}  is  a 
partition  of  IR'". 

Let  {X,  Y)  be  jointly  distributed  random  variables  with 
X  €  IR'^,  Y  €  {0,1}.  A  tree-structured  classifier-quantizer 
(TSCQ)  is  described  by  a  four-tuple  (T,a,/3,7),  where  (T,a) 
is  a  tree-structured  partition  as  above,  /?  :  T  — >  IR‘*  assigns 
a  vector  representative  to  each  t  £  T,  and  7  :  T  — »  {0,1} 
assigns  a  class  representative  to  each  t  £  T.  (The  four-tuple 
above  will  be  abbreviated  by  T.)  The  triple  (T, a,/?)  defines 
a  tree-stuctured  vector  quantizer  Qt  =  Yltef  ^ 

^Andrew  Nobel  is  with  the  Department  of  Statistics,  University 
of  North  Carolina,  Chapel  Hill.  He  is  currently  on  leave  at  the 
Beckman  Institute,  University  of  Illinois,  405  N.  Mathews,  Urbana, 
IL  61801. 


by  assigning  a  vector  representative  to  each  element  of  the 
partition  {Vj  :  t  €  T}.  Similarly  [T,a,'y)  defines  a  tree- 
structured  classification  rule  Ct  —  Stef  t(^)-^{*  €  Vt}. 
Let  D{T)  =  E\\X  -  Qt{X)\\'^  be  the  distortion  of  Qt  and 
R{T)  =  IP  {Ct  (A)  #  Y]  the  Bayes  risk  of  Ct-  Follow¬ 
ing  [1],  for  A  G  [0,1],  we  define  the  composite  risk  T x{T)  = 
XR{T)  -f-  (1  -  X)D(T). 

III.  Greedy  Growing 

Fix  a  TSCQ  T  and  let  t  G  f.  For  each  hyperplane  L  that 
intersects  V°  we  may  define  an  augmented  tree  T  =  T{t,  L) 
as  follows:  add  children  ti,t2  to  t  and  select  a(ti),af(t2)  €  Vt 
such  that  L  is  their  perpendicular  bisector;  for  i  =  1,2 
let  fi{ti)  be  the  Euclidean  centroid  of  V);  and  let  ~{{ti)  = 
argmingP{V  =  S\X  G  Vt;}.  In  this  way  D{T)  <  D{T)  and 
R(f)  <  R{T),  so  that  Tx{f)  <  Tx{T). 

A  training  sequence  Sn  =  {(Ai,  Fi), . . . ,  (An,  An)}  consists 
of  n  independent  replicas  of  (A,  V).  Given  Sn  and  an  it- 
feration  count  kn  the  greedy  growing  algorithm  produces  a 
nested  sequence  To  <  Ti  <  •  •  •  ^  of  TSCQ’s.  The  ini¬ 
tial  tree  To  consists  of  a  single  root  node  to  with  a{to)  arbi¬ 
trary,  fiito)  =  IjnY^Xi,  and  7(<c.)  the  majority  vote  among 
{Yi:l<  i  <  n].  Given  Tr,  the  algorithm  selects  a  terminal 
node  i*  G  f  and  a  hyperplane  L*  to  minimize  rA(Tr(<,  I)), 
and  then  sets  Tr+i  =  fr{t*,L*).  All  quantities  are  computed 
with  respect  to  the  empirical  distribution  of  Sn-  The  output 
of  the  algorithm  is  Tk„ . 

IV.  Results 

Let  R*  =  inf  P{C(A)  #  F},  where  the  infimum  is  taken 
over  all  classification  rules  C  :  IR'^  {0, 1}.  Unpruned  classi¬ 
fication  trees  produced  by  greedy  growing  are  Bayes  risk  con¬ 
sistent  if  Euclidean  distortion  is  given  a  non-zero  weighting  in 
the  composite  risk. 

Theorem  1  Let  A  <  1  and  suppose  that  the  marginal  distri¬ 
bution  of  X  has  a  density  such  that  E||A||^  <  00.  For  each 
n>l  let  Tn  be  produced  by  applying  the  greedy  growing  algo¬ 
rithm  to  the  training  sequence  Sn  for  kn  steps.  If  (i)  k„  ^  00 
and  (ii)  ra“^L„logn  0  then  R(Tn)  -<•  R*  with  probability 
one  as  n  —*  00. 
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Abstract  —  A  cluster-based  probability  model  has 
been  found  to  perform  extremely  well  at  capturing 
the  complex  structures  in  natural  textures  (e.g.,  bet¬ 
ter  than  Markov  random  field  models).  Its  success  is 
mainly  due  to  its  ability  to  handle  high  dimension¬ 
ality,  via  large  conditioning  neighborhoods  over  mul¬ 
tiple  scales,  and  to  generalize  salient  characteristics 
from  limited  training  data.  Imposing  a  tree  structure 
on  this  model  provides  not  only  the  benefit  of  reduc¬ 
ing  computational  complexity,  but  also  a  new  benefit 
-  the  trees  are  mutable,  allowing  us  to  mix  and  match 
models  for  difierent  sources.  This  fiexibility  is  of  in¬ 
creasing  importance  in  emerging  applications  such  as 
database  retrieval  for  sound,  image  and  video. 
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Abstract _ VQ-based  method  is  developed  as  an  effec-  The  reproduction  alphabets  of  the  resultant  optimal  quantiz- 

tive  data  reduction  technique  for  nonparametric  classi-  ers  are  then  retained  as  the  reduced  design  sets.  Then  the 
fier  design.  This  new  technique,  while  insistiug  on  com-  kernel  method  is  applied  as  usual  except  that  the  reduced 
petitive  classification  accuracy,  is  found  to  overcome  the  design  sets  are  used. 

usual  disadvantage  of  traditional  nonparametric  classi-  The  VQ-kNN  Classifier:  Similar  to  the  above  VQ-kernel 
fiers  of  being  computationally  complex  and  of  requiring  classifier  except  a  kNN  classifier  is  built  with  the  reduced 
large  amounts  of  computer  storage.  design  sets. 


I.  Introduction 

A  solution  to  the  excessive  complexity  problem  of  traditional 
nonparametric  classifiers  is  to  reduce  the  size  of  design  set 
while  insisting  that  the  classifiers  built  upon  the  reduced  de¬ 
sign  set  should  perform  as  well,  or  nearly  as  well  as  the 
classifiers  built  upon  the  original  design  set.  This  idea  has 
been  explicitly  explored  in  the  development  of  many  classi¬ 
fier  design  algorithms  using  reduced  sample  sets.  However, 
for  very  large  design  sets,  these  methods  are  often  tedious 
and  difficult  to  implement,  and  the  final  reduction  rate  is 
usually  low  [1]. 

We  introduce  a  new  approach  for  nonparametric  data  re¬ 
duction  using  the  vector  quantization  technique.  Combining 
vector  quantization  with  the  classical  Parzen’s  kernel  and 
the  kNN  approaches,  we  develop  two  new  algorithms  of  re¬ 
duced  nonparametric  classifier  design,  which  we  shall  denote 
the  VQ-kernel  and  the  VQ-kNN  methods. 

II.  Development  of  VQ-based 
Nonparametric  Classifiers 

The  philosophy  guiding  the  development  of  most  traditional 
nonparametric  classification  methods  is  that  of  using  the 
statistical  information  contained  in  a  set  of  pre-classified 
samples  (or  design  set),  for  finding  a  good  approximation 
of  the  actual  underlying  probability  density  function,  p(x). 
Then  the  classifier  is  built  by  applying  the  Bayesian  rule. 
However,  for  achieving  high  classification  performance,  this 
approximation  to  p(x),  while  it  is  obviously  sufficient,  is  not 
necessary.  Eor  example,  any  good  approximation  to  [p(x)]  , 
where  constant  a  >  0,  will  achieve  the  same  Bayesian 
classifier  as  that  achieved  by  approximating  p(x)  itself. 

In  [2],  Gersho  shows  that  for  an  optimal  quantizer,  in  the 
asymptotic  situation  where  the  level  of  quantizer  is  sufficiently 
large,  the  density  function  of  the  reproduction  vector  will 
closely  approximate  a  continuous  density  function  A(x)  which 
is  proportional  to  [p(x)]“,  where  a  is  a  constant  determined 
only  by  the  dimension  and  the  distance  measure.  This,  along 
with  our  argument  at  the  beginning  of  this  section,  strongly 
indicates  that  the  reproduction  alphabet  in  an  optimal  quan¬ 
tizer  could  be  used  as  an  effective  design  set  for  building 
classifiers. 

The  VQ-kemel  Classifier:  We  propose  that  vector  quantiza¬ 
tion  be  first  applied  to  the  original  design  set  of  each  class. 
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IV.  Simulations  and  Conclusions 

By  simulating  with  various  data  distributions,  the  perfor¬ 
mance  of  our  VQ-based  methods  are  compared  with  that 
of  the  traditional  reduction  algorithms  including  the  CNN, 
RNN,  ENN,  ECNN,  as  well  as  Fukunaga’s  reduced  Parzen 

[1].  Fig.  1  shows  error  rates  for  real  speech  data. 


Fig.  1.  Classification  error  rates  of  VQ-based 
classifiers  and  traditional  classifiers  for  speech  data. 


It  is  found  that  1):  the  VQ-kernel  classifier  outperforms, 
in  terms  of  accuracy,  all  other  data  reduced  algorithms  at 
all  the  reduction  levels;  2)  the  VQ-based  methods  usually 
achieve  tens  of  times  higher  reduction  rates  while  giving  the 
same  level  of  classification  accuracy  —  this  usually  means 
a  drastic  reduction  in  complexity  and  storage;  3)  finding 
the  reduced  design  set  in  the  VQ-based  methods  is  tens 
even  hundreds  times  faster  than  that  in  other  proposed  data 
reduction  algorithms  [1]. 

The  VQ-based  classifier  design  technique  is  also  extended  to 
the  design  of  histogram-based  classifiers  [3]. 

References 

[1]  Q.  Xie,  C.  A.  Laszlo,  and  R.  K.  Ward,  “Vector  quan¬ 
tization  technique  for  nonparametric  classifier  design,” 
IEEE  Trans,  on  Pattern  Anal.  Machine  Intel,  vol.  15, 
pp.  1326-30,  Dec.  1993. 

[2]  A.  Gersho,  “Asymptotically  optimal  block  quantiza¬ 
tion,”  IEEE  Trans,  on  Inform.  Theory,  vol.  IT-25, 
pp.  373-80,  July  1979. 

[3]  Q.  Xie,  R.  K.  Ward,  and  C.  A.  Laszlo,  “Multidimen¬ 
sional  histogram  classifier  design  by  using  vector  quan¬ 
tization,”  in  Proceedings  of  IEEE  Pacific  Rim  Conf.  on 
Comm.,  Comp.  Signal  Processing,  vol.  1,  (Vancouver), 
pp.  39-42,  Sept.  1993. 


Tree-Based  Models  for  Speech  and  Language 

Michael  D.  Riley 

AT&T  Bell  Laboratories,  600  Mountain  Ave.,  Murray  Hill,  NJ  07974 


Abstract  -  Several  applications  of  statistical  tree-based 
modeliing  are  described  to  problems  in  speech  and  lan¬ 
guage,  including  prediction  of  possible  phonetic  realiza¬ 
tions,  segment  duration  modelling  in  speech  synthesis  and 
end  of  sentence  detection  in  text  analysis. 

I.  INTRODUCTION 

Classification  and  regression  trees  [1]  are  well  suited  to  many 
of  the  pattern  recognition  problems  encountered  in  speech  and 
language  since  they  (1)  statistically  select  the  most  significant 
features  involved  (2)  permit  both  categorical  and  continuous 
factors  to  be  considered,  (3)  provide  “honest”  estimates  of 
their  performance,  and  (4)  allow  human  interpretation  and 
exploration  of  their  result.  Below  we  describe  several  appli¬ 
cations  of  these  methods  to  speech  and  language  processing. 

II.  PREDICTION  OF  POSSIBLE  PHONETIC 
REALIZATIONS 

A  lattice  of  possible  close  phonetic  transcriptions  given  a 
phonemic  transcription  (from  the  orthography  and  a  dictio¬ 
nary)  is  produced  using  a  6000  sentence,  multispeaker  tran¬ 
scribed  ^tabase  as  input.  The  resulting  phonetic  network 
predicts  the  correct  pronunciation  of  a  phoneme  on  test  data 
from  the  same  corpus  83%  of  the  time,  contains  the  correct 
phone  in  the  top  5  guesses  99%  of  the  time,  and  has  a  con¬ 
ditional  entropy  of  .8  bits.  This  compares  to  the  null  model, 
in  which  only  the  phoneme  to  realize  is  used,  that  predicts 
the  correct  phone  69%  of  the  time,  contains  the  correct  phone 
in  the  top  10  guesses  99%  of  the  time,  and  has  a  conditional 
entropy  of  1.5  bits.  [2] 

III.  SEGMENT  DURATION  MODELLING  IN  SPEECH 
SYNTHESIS 

400  utterances  from  a  single  speaker  and  4000  utterances 
from  400  speakers  of  American  English  were  used  to  build 
optimal  decision  trees  that  predict  segment  durations.  Over, 
70%  of  the  durational  variance  for  the  single  speaker  and  over 
60%  for  the  multiple  speakers  were  accounted  for  by  this 
method  when  using  information  only  at  the  word  level  and 
below.  These  trees  were  used  to  derive  durations  for  a  text-to- 
speech  synthesizer  and  were  found  to  give  results  comparable 
to  the  existing  heuristically  derived  duration  rules.  Since  tree 
building  and  evaluation  is  rapid  once  the  data  are  collected  and 
the  candidate  features  specified,  the  technique  can  be  readily 
applied  to  other  feature  sets  and  to  other  languages.  [3] 


IV.  END  OF  SENTENCE  DETECTION 
The  not-so-simple  problem  of  deciding  when  a  period  in  text 
corresponds  to  the  end  of  a  declarative  sentence  (and  not  an 
abbrev.)  is  attempted  with  trees  using  the  Brown  corpus  as 
input.  The  result  is  99.8%  correct  classification.  The  many 
special  cases  required  to  solve  this  problem  well,  nicely  show 
the  value  of  the  tree  approach  here.  The  majority  of  the  errors 
are  due  to  difficult  cases,  e.g.  a  sentence  that  ends  with  “Mrs.” 
or  begins  with  a  numeral  [4]. 

V.  DISCUSSION 

On  the  whole,  we  have  found  classification  and  regression 
trees  quite  useful  in  modelling  a  variety  of  phenonema  in 
speech  and  language.  In  part,  it  is  their  ability  to  handle  both 
categorical  and  continuous  inputs  and  outputs  that  makes  them 
attractive  to  us.  The  fact  that  they  offer  efficient  algorithms,  a 
well-established  cross-validation  procedure,  and  a  relatively 
perspicuous  representation  makes  them  more  appealing  to  us 
than,  say,  back-propogation  neural  networks  for  the  problems 
we  have  described. 

The  principal  difficulty  we  have  found  with  this  and  similar 
statistical  approaches  is  that  while  the  trees  classify  well  most 
of  the  time,  they  occasionally  make  egregious  errors.  When 
noticed,  it  is  possible  to  correct  these  errors  by  hand  modifi¬ 
cation  of  the  trees.  This  is,  however,  quite  tedious.  Further,  if 
new  data  are  used  or  new  input  features  are  tried,  the  editing 
has  to  be  redone  (if  the  error  remains). 

What  would  be  most  appealing  to  us  would  be  techniques 
that  would  allow  easy  mixing  of  statistical  learning  with  hand 
specification.  The  user  could  hand  specify  what  he  is  sure 
of  and  leave  to  the  statistics  to  fill  in  the  rest  the  best  it  can, 
letting  us  have  our  cake  and  eat  it  too. 
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Image  compression  is  often  approached  from  an  angle  of 
statistical  image  classification.  For  instance,  VQ-based  im¬ 
age  coding  methods  compress  image  data  by  classifying  im¬ 
age  blocks  into  representative  two-dimensional  patterns  (code¬ 
words)  that  statistically  approximate  the  original  data.  An¬ 
other  image  compression  approach  that  naturally  relates  to 
image  classification  is  segmentation-based  image  coding  (SIC). 
In  SIC,  we  classify  pixels  into  segments  of  certain  uniformity 
or  similarity,  and  then  encode  the  segmentation  geometry  and 
the  attributes  of  the  segments. 

Image  segmentation  in  SIC  has  to  meet  some  more  strin¬ 
gent  requirements  than  in  other  applications  such  as  computer 
vision  and  pattern  recognition.  Firstly,  the  segmentation  de¬ 
scription  must  be  compact  to  ensure  low  bit  rate.  Secondly, 
the  classification  criterion  should  quantify  visual  differentia¬ 
tion  of  image  patterns.  Thirdly,  the  segmentation  process  has 
to  be  fast  enough  to  suit  image/video  coding  purposes. 

An  efficient  SIC  coder  has  to  strike  a  good  balance  between 
accurate  semantics  and  succinct  syntax  of  the  segmentation. 
From  a  pure  classification  point  of  view,  free  form  segmen¬ 
tation  by  relaxation,  region-growing,  or  split-and-merge  tech¬ 
niques  offers  an  accurate  boundary  representation.  But  the 
resulting  segmentation  geometry  is  often  too  complex  to  have 
a  compact  description,  defeating  the  purpose  of  image  com¬ 
pression.  Instead,  we  adopt  a  bintree-structured  segmenta¬ 
tion  scheme.  The  bintree  is  a  binary  tree  created  by  recursive 
rectilinear  bipartition  of  an  image.  The  bintree-structured 
segmentation  is  semantically  more  flexible  than  the  popular 
quadtree,  and  yet  it  has  as  simple  syntax  as  the  quadtree. 
This  nice  property  translates  to  compression  gains. 

Large  and  smooth  surfaces  of  a  natural  scene  correspond  to 
regions  of  fairly  continuous  intensities  in  its  digital  image.  In 
these  regions  pixel  values  can  be  fit  well  by  a  low-order  polyno¬ 
mial,  and  the  least-square  piecewise  functional  approximation 
yields  more  compact  image  description  than  DCT  and  VQ. 
Luckily,  the  majority  areas  of  a  natural  image  fall  into  this 
category.  This  is  demonstrated  by  the  facts  that  most  code¬ 
words  in  a  VQ  codebook  form  smooth  shading  patterns,  and 
most  DCT  blocks  have  dominant  low  frequency  coefficients. 
The  main  advantage  of  SIC  over  VQ  and  DCT  is  in  that  it 
can  adaptively  fit  an  image  with  more  flexible  segments  than 
fixed  blocks,  resulting  in  fewer  segments  hence  shorter  descrip¬ 
tion.  However,  in  the  areas  of  textures  or  edges,  least-square 
piecewise  fitting  breaks  down  since  higher  order  terms  induce 
more  real  coefficients  to  be  quantized  and  coded.  In  com¬ 
parison,  VQ  technique  is  more  suitable  to  classify  and  com¬ 
press  textures.  Thus  within  a  SIC  framework,  least-square 
approximation  and  VQ  can  complement  each  other  for  higher 
compression  than  either  method  alone.  This  necessitates  the 
classification  of  an  image  into  smooth  and  texture  regions. 

Many  possible  classifiers  can  be  used  to  decide  whether  a 
bintree  block  is  smooth  or  contains  textures  or  edges.  Since  we 
use  least-square  approximation  to  code  smooth  regions,  it  is 
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convenient  to  base  the  classifier  on  variance.  We  fit  pixel  val¬ 
ues  by  a  low-order  polynomial  to  form  a  largest  bintree  block 
possible  under  a  given  error  tolerance.  The  size  of  the  bintree 
block  serves  as  the  classifier.  If  the  size  exceeds  a  threshold, 
we  hypothesize  that  the  intensity  function  is  smooth  in  that 
area,  and  consequently  code  the  block  with  quantized  poly¬ 
nomial  coefficients.  Otherwise,  we  hypothesize  that  the  block 
contains  rich  textures.  An  additional  VQ  texture  coding  is 
employed  on  the  block  to  get  a  better  approximation.  Using 
bintree  block  size  as  the  classifier  means  that  no  side  informa¬ 
tion  is  required  to  identify  the  type  of  the  segment. 

To  design  a  segmentation  algorithm,  we  can  either  split 
the  image  top-down  or  merge  primitive  blocks  bottom-up. 
But  there  are  two  advantages  to  the  bottom-up  merge  ap¬ 
proach.  By  merging  smaller  blocks,  we  do  not  unnecessarily 
solve  the  least-squares  problem  in  large  bintree  blocks  which 
cannot  possibly  be  leaf  nodes,  reducing  algorithm  complexity. 
Also,  by  examining  smaller  blocks  first,  we  avoid  segment  mis- 
classification  due  to  smoothing  of  prominent  local  textures  by 
least-square  fitting. 

A  main  result  of  this  research  is  a  texture  code  based  on 
binary  VQ.  Let  f{x,y)  be  the  input  image  and  g{x,y)  be  the 
least-square  linear  approximation  of  f{x,y)m  a  texture  block. 
We  model  the  texture  to  be  e{x,y)  =  f{x,y)  —  g(x,y).  DCT 
or  VQ  can  be  used  to  encode  e{x,y).  But  lower  bit  rate  can  be 
achieved  for  the  same  transparent  image  quality  by  taking  the 
advantage  of  the  fact  that  in  high  texture  areas  human  visual 
system  is  less  sensitive  to  intensity  resolution.  Therefore,  we 
coarsely  quantize  the  amplitude  of  e{x,  y)  into  only  two  levels, 
and  map  e{x,y)  to  a  binary  texture  pattern  m{x,y),  where 
m{x,y)  =  0  if  e(x,y)  >  0  and  m{x,y)  =  1  if  e{x,y)  <  0. 
Depending  on  m{x,y)  =  0  or  m{x,y)  —  1,  e{x,y)  is  quantized 
to  -o-  or  a,  where  cr  is  the  standard  deviation  of  e(i,y).  Note 
that  e(a:,  y)  is  zero  mean  since  it  is  the  residual  function  of  a 
least-square  linear  approximation.  Consequently,  this  simple 
bi-level  quantization  preserves  both  mean  and  variance  of  the 
original  image  just  like  in  block  truncation  coding. 

To  obtain  rates  lower  than  1  bit/pixel,  we  have  to  com¬ 
press  the  texture  pattern  m{x,y)  as  well.  This  is  a  problem 
of  texture  classification  which  can  be  solved  by  binary  VQ. 
But  a  direct  use  of  the  LBG  algorithm  often  fails  to  produce 
a  satisfactory  codebook,  because  the  number  of  local  optima 
in  the  sample  space  of  binary  vectors  is  too  numerous  for  a 
gradient-descent  optimization  method  to  improve  on  an  ini¬ 
tial  codebook.  To  worsen  the  problem,  the  expected  Ham¬ 
ming  distance  previously  used  in  binary  VQ  as  the  distortion 
measure  does  not  proportionally  quantify  the  visual  quality 
degradation  caused  by  inverted  bits.  This  is  because  the  per¬ 
ceived  texture  reproduction  quality  is  affected  not  only  by  the 
average  error  but  also  by  burst  errors.  It  is  important  not  to 
invert  two  or  more  adjacent  bits.  We  discover  that  the  use 
of  linear  codes  for  binary  VQ  in  the  spirit  of  optimal  sphere 
covering  offers  remedies  to  both  problems  of  local  minimum 
trap  and  burst  error. 
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Coding  for  Noisy  Feasible  Channels 

Richard  J.  Liptont 
Department  of  Computer  Science 
Princeton  University 
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rjl@princeton.edu 

Abstract:  We  prove  a  constructive  version  of  Shannon’s  Fundamental  Theorem  of  Information 
Theory.  The  new  theorem  holds  for  any  feasible  channel.  A  channel  is  feasible  provided  it  is 
computable  by  a  polynomial  time  computation. 

Our  main  result  is  a  new  constructive  proof  of  Shannon’s  Theorem.  Consider  a  feasible 
channel.  Then,  there  is  a  coding  method  C  with  the  following  properties: 

(1)  We  can  construct  C  in  polynomial  time. 

(2)  We  can  encode  any  message  in  polynomial  time. 

(3)  We  can  decode  any  message  in  polynomial  time. 

(^)  The  probability  that  the  method  makes  an  error  goes  to  0  at  least  as  fast  as 

(5)  The  rate  of  the  method  can  be  as  close  to  the  capacity  of  the  channel  as  one  wishes. 

How  do  we  construct  these  codes?  Following  Lipton  [1]  we  restrict  the  channel  to  be  “feasible” . 
That  is  we  restrict  the  channel  to  only  use  random  polynomial  time  to  decide  which  bits  to 
change.  Essentially,  any  channel  is  characterized  by  two  parameters:  (i)  how  “mean”  it  is;  (ii) 
how  “smart”  it  is.  We  measure  how  mean  a  channel  is  by  how  many  bits  it  can  change.  We 
measure  how  smart  a  channel  is  by  how  much  computation  it  is  allowed  to  perform  to  decide 
which  bits  to  change.  Thus,  our  key  point  is:  only  allow  channels  with  smartness  bounded  by 
random  polynomial  time. 

We  claim  that  “real”  channels  have  their  smartness  limited  in  this  way.  This  is  an  assertion 
like  “Church’s  Thesis”  and  of  course  cannot  be  “proved” .  It  is,  we  claim,  a  reasonable  assump¬ 
tion.  Real  channels  are  analog/ digital  systems.  It  certainly  appears  to  be  reasonable  to  assume 
that  such  systems  cannot  do  more  computing  than  random  polynomial  time.  If  a  real  channel 
existed  that  could  do  more  than  polynomial  time  computation,  then  perhaps  we  could  use  it  to 
solve  intractable  problems!  Note,  in  classic  information  theory  often  the  most  powerful  kind  of 
channel  considered  is  a  channel  that  is  finite  state.  Of  course  this  is  in  our  cleiss. 
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Coding  for  Distributed  Computation 
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Extended  communications  among  component 
processors  are  essential  to  the  operation  of  all  but 
the  simplest  computers.  In  this  talk  we  are  con¬ 
cerned  with  the  following  question;  if  the  communi¬ 
cations  among  processors,  linked  in  some  network, 
are  unreliable,  what  is  the  effect  on  the  efficiency 
and  reliability  with  which  the  network  can  perform 
a  computation? 

An  important  case  of  this  scenario,  that  in  which 
there  are  two  processors  and  the  required  task 
is  to  transmit  a  large  block  of  data  from  one  to 
the  other,  actually  predates  large-scale  computing. 
Shannon’s  coding  theorem  addresses  this  problem, 
and  shows  that  in  order  to  reliably  transmit  a  mes¬ 
sage  of  T  bits  over  a  noisy  communication  chan¬ 
nel  it  suffices  to  send  a  message  of  length  T ^  (for 
0  <  C  <  1  the  “Shannon  capacity”  of  the  channel). 
The  theorem  ensures  that  the  probability  of  a  de¬ 
coding  error  is  exponentially  small  in  the  message 
length  T. 

We  will  describe  analogous  coding  theorems  for 
the  more  general,  interactive,  communications  re¬ 
quired  in  computation.  In  this  case  the  bits  trans¬ 
mitted  in  the  protocol  are  not  known  to  the  pro¬ 
cessors  in  advance  but  are  determined  dynamically. 
Therefore  the  block  encoding  technique  used  in  the 
proof  of  Shannon’s  theorem,  does  not  apply. 

First  we  show  that  any  interactive  protocol  of 
length  T  between  two  processors  connected  by  a 
noiseless  channel  can  be  simulated,  if  the  channel 
is  noisy  (a  binary  symmetric  channel  of  capacity 

‘Research  supported  by  an  NSF  Mathematical  Sciences 
Postdoctoral  Fellowship. 


C),  in  time  proportional  to  T^,  and  with  error 
probability  exponentially  small  in  T. 

Then  we  show  that  this  result  can  be  extended  to 
arbitrary  distributed  network  protocols.  We  show 
that  any  distributed  protocol  which  runs  in  time  T 
on  a  network  of  degree  d  having  noiseless  commu¬ 
nication  channels,  can,  if  the  channels  are  in  fact 
noisy,  be  simulated  on  that  network  in  time  pro¬ 
portional  to  T-^  logd.  The  probability  of  failure  of 
the  protocol  is  exponentially  small  in  T. 

Preliminary  presentations  of  these  results  can  be 
found  in  [1,  2]. 

The  network  theorem  is  joint  with  Sridhar  Ra- 
jagopalan. 
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Abstract  —  This  is  a  tutorial  survey  of  recent  in¬ 
formation  theoretic  results  dealing  with  the  minimal 
randomness  necessary  for  the  generation  of  random 
processes  with  prescribed  distributions. 

I.  Introduction 

Shannon  Theory  explores  the  fundamental  limits  on  the  size 
of  codes  that  enable  the  reliable  reproduction  or  transmission 
of  information.  Reliability  is  typically  quantified  by  the  prob¬ 
ability  that  the  decoded  message  is  equal  to  the  original  one, 
or  by  some  measure  of  the  distance  between  the  original  and 
decoded  messages. 

In  this  paper  we  are  not  interested  in  the  reproduction 
or  transmission  of  information  but  rather  in  the  generation 
of  random  processes  with  prescribed  distributions,  and  asso¬ 
ciated  problems.  For  example,  we  may  want  to  simulate  a 
“real-world”  random  process,  or  the  response  of  a  system  to 
such  an  input.  Random  process  generation  is  accomplished  by 
adequately  mapping  a  source  of  pure  random  bits.  A  key  ques¬ 
tion  that  quantifies  the  “complexity”  of  the  random  process  is 
the  minimai  randomness  of  the  source  of  pure  bits  necessary 
to  accomplish  the  task.  As  in  conventional  Shannon  theory,  a 
rich  theory  arises  when  some  distance  (often  arbitrarily  small) 
is  allowed  between  the  desired  and  the  resulting  probability 
distributions.  To  this  end,  several  distance  measures  have 
been  considered  in  the  literature,  such  as  variational  distance, 
divergence,  p-distance,  etc. 

II.  Source  Resolvability 

The  resolvability  of  a  source  is  defined  [1]  as  the  minimal 
number  of  random  bits  per  sample  it  takes  to  reproduce  the  n- 
dimensional  distributions  with  arbitrary  accuracy  as  n  tends 
to  infinity.  A  general  formula  is  shown  in  [1]  for  the  resolv¬ 
ability  of  a  source.  For  the  special  case  of  stationary  ergodic 
sources  and  variational  distance  it  is  equal  to  the  entropy 
rate.  Reference  [7]  considers  the  problem  of  finite-precision 
resolvabUity  where  the  approximation  distance  need  not  be 
arbitrarily  small,  and  shows  that  for  any  information  stable 
source  R-resolvabihty  is  independent  of  D  in  the  special  case 
of  variational  distance.  However,  with  less  stringent  approx¬ 
imation  measures  such  as  the  Prohorov  and  p-distance,  the 
D-resolvabUity  is  shown  in  [7]  to  be  given  by  the  rate  distor¬ 
tion  function  evaluated  with  a  sample-path  distortion  metric 
derived  from  the  distribution  distance  measure. 

III.  Channel  Resolvability 

In  system  simulation,  the  objective  is  to  induce  the  same 
output  distributions  as  those  that  would  obtain  with  a  “real- 
world”  input.  The  channel  (or  system)  resolvability  defined 
as  the  minimal  randomness  required  to  generate  any  desired 
input  so  that  the  output  distributions  are  approximated  with 
arbitrary  accuracy.  Naturally,  the  more  “random”  a  system  is, 
the  lower  its  resolvabihty,  as  it  does  not  pay  to  reproduce  fine 
details  in  the  input  distributions.  It  is  shown  in  [1]  that  the 
channel  resolvabUity  is  equal  to  its  capacity  for  most  discrete 


channels  (those  that  satisfy  the  strong  converse).  The  com¬ 
plementary  problem  where  the  input  is  given  but  the  channel 
is  to  be  simulated  is  studied  in  [5],  where  it  is  shown  that 
the  minimal  randomness  required  to  simulate  the  system  for 
a  specific  input  is  equal  to  the  conditional  entropy  rate  of  the 
output  given  the  input. 

IV.  Intrinsic  Randomness 
A  problem  which  is  dual  to  source  resolvabUity  is  the  max¬ 
imal  randomness  rate  that  can  be  extracted  from  an  arbitrary 
source.  The  intrinsic  randomness  rate  of  a  source  is  defined  in 
[9]  as  the  largest  rate  of  almost-fair  coin  flips  that  can  be  ex¬ 
tracted  by  a  deterministic  mapping  of  the  source.  For  station¬ 
ary  ergodic  sources  and  variational  distance  the  intrinsic  ran¬ 
domness  rate  is  equal  to  the  entropy  rate  [9].  However  there 
are  nonstationary  sources  for  which  the  intrinsic  randomness 
rate  is  not  equal  to  the  minimal  noiseless  source  coding  rate. 
The  more  general  problem  of  finite  precision  intrinsic  random¬ 
ness  is  studied  in  [10].  Using  variational  distance,  [10]  shows 
that  the  finite  precision  intrinsic  randomness  rate  is  given,  es¬ 
sentially,  by  the  inverse  asymptotic  distribution  of  the  entropy 
density. 
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I.  Introduction  and  Definitions 

Random  number  generators  are  important  devices  in  ran¬ 
domized  algorithms,  Monte-Carlo  methods,  and  in  simulation 
studies  of  random  systems.  A  random  number  generator  is 
usually  modeled  as  a  random  source  emitting  independent, 
equally  likely  random  bits.  In  practice,  the  random  source  one 
has  at  hand  can  deviate  from  this  idealized  model,  and  the  ran¬ 
dom  number  generator  operates  by  appl}dng  a  deterministic 
mapping  on  the  output  of  the  (nonideal)  random  source.  The 
deterministic  mapping  is  chosen  so  that  the  resulting  process 
approximates  -  in  some  sense  -  a  sequence  of  independent, 
equally  likely  random  bits.  A  prime  measure  of  the  intrinsic 
randomness  of  a  given  source  X  is  the  maximal  rate  at  which 
random  bits  can  be  extracted  from  X  by  suitably  mapping 
its  output.  This  maximal  rate  depends  on  the  statistics  of  the 
source  X  and  on  the  sense  of  approximation.  In  [1]  it  is  shown 
that  the  maximal  rate  at  which  arbitrarily  accurate  approxi¬ 
mations  of  pure  random  bits  can  be  extracted  from  X  equals 
its  inf  entropy  rate,  HJ^X).  The  measures  of  accuracy  with 
respect  to  which  this  result  was  shown  to  hold  are  the  varia¬ 
tional  distance,  the  d  distance  and  normalized  divergence. 

In  problems  like  randomized  algorithms,  or  Monte-Carlo 
simulations,  an  arbitrarily  accurate  approximation  of  pure 
random  bits  may  be  more  than  what  we  need,  and  a  con¬ 
trolled  deviation  from  pure  random  bits  can  be  tolerated.  In 
such  cases,  one  may  wish  to  increase  the  rate  of  generation  of 
random  bits  at  the  expense  of  a  coarser  approximation  of  the 
desired  fair  coin  flip  distributions.  In  this  work  we  study  the 
problem  of  finite-precision  random  bit  generation,  where  the 
accuracy  measure  is  the  variational  distance.  The  results  pre¬ 
sented  here  extend  part  of  the  results  in  [1]  and  also  provide 
a  nice  counterpart  to  the  finite-precision  source  resolvability 
problem  that  was  studied  in  detail  in  [2]. 

Throughout,  X  is  a  random  source  with  finite  alphabet  A, 
and  logarithms  have  bcise  2.  We  start  with  a  few  definitions. 
Definition  1  [1]  iZ  is  a  D-achievable  intrinsic  randomness 
rate  of  X  if  there  exists  a  sequence  of  deterministic  mappings 
;  A"  -►  {0,  !}'■  such  that  for  all  7  >  0  and  sufficiently 
large  n, 

-  >  R-7 
n 

and 

where  B''  stands  for  an  equiprobable  distribution  over  (0, 1}’' 
and  dv{-,-)  is  the  variational  distance  between  distributions. 

Definition  2  The  finite-precision  intrinsic  randomness  rate 
of  X  is  defined  as  the  supremum  of  the  D-achievable  intrinsic 
randomness  rates  of  X  and  is  denoted  by  Uv{D,X). 

Note  that  i7»(2,  X)  =  00  for  every  source  X.  The  next  defini¬ 
tion  deals  with  the  relevant  information  theoretic  function. 

Definition  3  The  variational  inf  rate-distortion  function  of 
X,  R„{D),  is  defined  as  the  supremum  over  all  real  numbers 
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k  satisfying 

Thus,  R^iD)  is  the  largest  real  number  h  such  that  the  mass 
of  the  entropy  density  to  the  left  of  h  does  not  exceed  D/2, 
asymptotically.  Note  that  for  every  source  ^(0)  equals  the 
inf  entropy-rate  of  the  source,  D(X),  and  R„(2)  =  00. 

II.  Results 

Theorem  1 

U4D,X)  =  R„(D). 

The  next  corollary  is  an  easy  consequence  of  Definition  3  and 
Theorem  1:  it  implies  that  if  X  is  information  stable,  one 
cannot  increase  the  asymptotic  rate  of  production  of  random 
bits  by  increasing  their  deviation  (w.r.t.  variational  distance) 
from  ideal  fair  coin  flips.  This  result  has  a  nice  counterpart  in 
the  finite-precision  source  resolvability  problem:  it  is  shown 
in  [2]  that  if  X  is  information  stable,  then  its  variational  finite- 
precision  resolvability  5„(D,X)  is  independent  of  D  in  the 
region  0  <  D  <  2. 

Corollary  1  If  X  is  information  stable,  then  for  0  <  D  <  2 
R„(D)  =  U.(D,X)  =  D(X). 

In  [2]  the  variational  finite-precision  source  resolvability  was 
characterized  as  the  infimum  of  the  sup  information  rate  over 
an  appropriate  class  of  channels  —  the  corresponding  sup  rate- 
distortion  function.  The  nice  duality  between  the  problems 
of  finite-precision  source  resolvability  and  finite-precision  bit 
generation,  and  Corollary  1,  leads  one  to  suspect  that  the 
variational  finite-precision  source  resolvability  (and  hence  also 
the  sup  rate-distortion  function)  as  defined  in  [2]  admits  a 
simpler  characterization  -  such  as  that  in  Definition  3.  This 
is  indeed  the  case,  as  one  can  see  from  the  following  theorem. 

Theorem  2 

5v(D,X)  =  Rv(D) 

=  inf|h: 

Thus,  the  variational  sup  rate-distortion  function  as  defined 
in  [2]  is  also  equal  to  the  smallest  real  number  h  such  that  the 
mass  of  the  entropy  density  to  the  right  of  h  does  not  exceed 
D/2,  asymptotically. 
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I.  Introduction 

In  this  paper,  a  combined  problem  of  source  coding  and  iden¬ 
tification  is  considered.  To  put  our  problem  in  perspective, 
let  us  first  review  the  traditional  problem  in  somrce  coding 
theory.  Consider  the  following  diagram,  where  {Yn}^i  is  an 


Figure  1;  Model  for  source  coding 

i.i.d  source  taldng  values  on  a  finite  eJphabet  X.  The  encoder 
output  is  a  binsiry  sequence  which  appecirs  at  a  rate  R  bits 
per  symbol.  The  decoder  output  is  a  sequence  {Y„}“  which 
tcike  Vcilues  on  a  finite  reproduction  alphabet  y .  In  traditional 
source  coding  theory,  the  decoder  is  required  to  be  able  to  re¬ 
cover  {Y”}r’  completely  or  with  some  ciUowable  distortion. 
That  is,  the  output  {Xn}”  must  satisfy 

n 

r.-'^p(Xi,X.)<d  (1) 

1  =  1 

for  sufficiently  large  n,  where  p  :  X  x  y  -¥  [0, -|-oo)  is  a  dis¬ 
tortion  measure  and  d  >  0  is  the  allowable  distortion.  The 
problem  is  then  to  determine  the  infimum  of  rate  R  such  that 
the  system  shown  in  Fig.l  can  operate  in  such  a  way  that  (1) 
is  satisfied.  From  rate  distortion  theory,  this  infimum  is  given 
by  the  rate  distortion  function  of  the  source  {Xn}i° . 

Let  us  now  consider  the  system  shown  in  Fig.  2.  The  se- 
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. 

binary  data  of 

decoder 

rate  R 

Figure  2:  Model  for  joint  source  coding  and  identification. 

quence  {Ynjr  is  a  sequence  of  i.i.d  random  Vciriables  tcddng 
values  on  y.  Known  {Yn},  the  decoder  is  now  required  to  be 
able  to  identify  whether  or  not  the  distortion  between  {X„} 
and  {Yn}  is  less  than  or  equal  to  d  in  such  a  way  that  two 
kinds  of  error  probabilities  satisfy  some  prescribed  conditions. 
The  problem  we  are  now  interested  in  is  stiU  to  determine  the 
infimum  of  rate  R  such  that  the  system  shown  in  Fig.2  ccui 
operate  in  this  way. 

II.  Formal  Formulation  of  Problem 

Let  {(X„,Yn)}r  be  a  sequence  of  independent  drawings  of 
a  pair  (X,  Y)  of  random  variables  tcddng  values  on  X  x  3^ 
with  joint  distribution  Pxy-  Fix  0  <  d  <  Ep(X,  Y).  An 
nth-order  identification  (ID)  code  Cn  is  defined  to  be  a  triple 
Cn  -  {fn,Bn,gn),  where  Bn  C  {0, 1}*  is  a  prefix  set,  /„(called 
an  “encoder”)  is  a  mapping  from  A"  to  Bn,  and  (called  a 


“decoder”)  is  a  mapping  from  3’"  x  Bn  ->  {0,  !}•  When  Cn 
is  used  in  the  system  shown  in  Fig.2,  its  performance  can  be 
measured  by  the  following  three  quantities:  the  resulting  av¬ 
erage  rate  defined  by  rn(Cn)  =  En  ^(the  length  of  /n(X")), 
the  first  kind  of  error  probabihty  defined  by  pei(Cn)  = 
Pr{g„(Y",/„(X"))  =  0|p„(X",Y")  <  d},  and  the  second 
of  error  probabihty  defined  by  pe2  =  Pr{9n(^"i /"(■^"))  “ 
iMX",Y")>d}. 

Let  R  e  [0,+oo),  a  €  (0,-t-oo]  and  d  €  (0,+oo].  A  triple 
[R,a,P)  is  said  to  be  achievable  if  for  any  e  >  0,  there  exists 
a  sequence  {C„}  of  ID  codes,  where  C„  =  ifn,Bn,gn)  is  an 
nth-order  ID  code,  such  that  for  sufficiently  large  n, 

rn(Cn)  <R  +  e,  Pel  <  and  pe2  <  , 

where  as  a  convention,  a  —  -foo(/3  =  -foo,  resp.)  means 
that  the  first(second,  resp.)  kind  of  error  probabihty  of  C„ 
is  equal  to  0.  Let  Tl  denote  the  set  of  all  achievable  triples. 
In  this  paper,  we  are  interested  in  determining  the  closure 
%  of  R.  Specifically,  we  define  for  each  pair  (a,P),  where 
a,  P  &  [0,-t-oo], 

RxYio‘,fi,d)  =ird{R\{R,a,l3)  eR}  . 

Our  mmn  problem  is  then  the  determination  of  the  function 


III.  Main  Results 

Assume  that  X  and  Y  are  independent.  For  any  0  <  d  < 
Ep{X,Y),  define  /3(d)  by  /3(d)  =  inf D(P||Pxy),  where  the 
infimum  is  taken  over  all  distributions  P  on  A  x  3^  such  that 
^  P{x,y)p{x,y)  <  d.  Let  U  he  a  random  variable  tak¬ 
ing  values  on  some  finite  set  U.  Let  Pxu  denote  the  joint 
distribution  of  X  and  U.  For  any  o  >  0,  define 

£{Pxu,a,d)  =  inf{D(P5>||Py)  -b  I{U-Y)}  ,  _ 

where  the  infimum  is  taken  over  all  random  vairiables  Y  tak¬ 
ing  values  on  3^  such  that  Pp(X,  Y)  <  d  and  D(Py||Py)  -t- 
/(XU;  Y)  <  /0(d)  -f  a.  Here  we  make  use  of  the  convetion 
that  the  infimum  taken  over  an  empty  set  is  -foo.  We  define 
for  ciny  >  0 

R{Px,PY,a,P,d)  =  inf{/(X;£/)|{/is  a  R.V.  with  £{Pxu,a,d)  >  /3} 
cuid  let 

P(Px,Py,a,0,  d)  =  hm  R{Px,PY,a,P,d)  . 

^-(•0+ 

The  following  theorem  gives  a  general  formula  for 

d)- 

Theorem  1  For  any  0  <  d  <  Ep(X,Y),  0  <  /3  <  /3(d),  and 
«  €  (0,  -boo],  the  following  holds 

RxY{o‘,P,d)  =  R{Px,PY,a,P,d)  , 

where 

R{Px,PY,oi,P,d)  =  hm  R{Px ,  Py,oi, P' ,d)  . 

The  converse  part  of  Theorem  1  is  related  to  the  genereil 
isoperimetric  problem.  During  the  process  of  proving  the  con¬ 
verse  part,  we  develope  a  new  powerful  method  for  converse¬ 
proving  in  multi-user  information  theory.  For  more  details, 
pleaise  refer  to  [1]. 
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Testing  of  Composite  Hypotheses  and  ID-codes 
M.V.Burnashev  and  S.Verdu 


Abstract  -  A  geometrical  approach  to  ID- 
codes,  based  on  their  equivalence  to  some  natural 
notions  from  mathematical  statistics  is  described. 
That  not  only  enlarges  the  available  analytical  ap¬ 
paratus,  but  also  enables  us  to  strengthen  some 
known  results. 

Let  A  and  B  be  finite  input  and  output  alphabets  of 
a  stationary  memoryless  channel  with  conditional  tran¬ 
sition  probabilities  iy(6|a),a  e  A,b  e  B.  If  P  is  some 
probability  distribution  (measure)  on  the  channel  input 
then  by  Q  =  we  denote  the  generated  distri¬ 

bution  on  the  channel  output  P". 

Definition  1  [1].  A  collection  {Pi,'Di,i  = 
of  probability  measures  Pj  on  A”  and  regions  P,  C  P"  is 
called  an  (M,  n,  S)  -  ID-code  if  the  following  conditions 
are  satisfied: 

QiiBi)  >  1  -  <5  and  QiiVj)  <  S  for  any  i  j. 

What  concerns  the  maximal  cardinality  M  (n,  J)  of  ID- 
codes,  it  is  known  that  [1,2] 

0<S<S„,  (1) 

n->oo  n 

where  C  -  channel  capacity  and  Jq  is  some  positive  con¬ 
stant. 

Another  meaning  of  Definition  1  is  that  the  collection 
of  measures  {P,}  of  an  ID-code  has  the  following  prop¬ 
erty:  any  simple  hypotheses  Pi  can  be  “tested”  against 
the  composite  alternative  consisting  of  all  remaining  mea¬ 
sures  {Pj,  i  #  i}  from  the  same  family.  Or,  any  measure 
Pi  is  “almost  orthogonal”  to  the  convex  combination  of 
all  remaining  measures. 

We  develop  [3]  in  a  quantative  manner  this  connection 
between  ID-codes  and  the  testing  of  composite  hypothe¬ 
ses.  Such  an  approach  not  only  enlarges  the  research 
analytical  apparatus,  but  also  enables  us  to  strengthen 
some  results  from  [1,2].  In  particular,  it  is  shown  that 
the  equality  (1)  remains  valid  for  any  0  <  <5  <  1/2  .  That 
gives  certain  completeness  to  (1),  since  for  5  >  1/2  the 
number  M  (n,  8)  becomes  infinite  provided  that  random¬ 
ized  decision  rules  are  allowed  for  use. 
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Abstract  —  A  minimum  description  length  crite¬ 
rion  for  inference  of  functions  in  both  parametric  and 
nonparametric  settings  is  determined.  By  adapting 
the  parameter  precision,  a  description  length  crite¬ 
rion  can  take  on  the  form  —  log(likelihood)  -t-  const  •  m 
instead  of  the  familiar  —  log  (likelihood) (m/2)  log  n 
where  m  is  the  number  of  parameters  and  n  is  the 
sample  size.  For  certain  regular  models  the  criterion 
yields  asymptotically  optimal  rates  for  coding  redun¬ 
dancy  and  statistical  risk.  Moreover,  the  convergence 
is  adaptive  in  the  sense  that  the  rates  are  simulta¬ 
neously  minimax  optimal  in  various  parametric  and 
nonparametric  function  classes  without  prior  knowl¬ 
edge  of  which  function  class  contains  the  true  func¬ 
tion.  This  one  criterion  combines  positive  benefits  of 
information-theoretic  criteria  proposed  by  Rissanen, 
Akaike,  and  Schwarz.  It  is  also  reviewed  how  the  min¬ 
imum  description  length  principle  provides  accurate 
estimates  in  irregular  models  such  as  neural  nets. 

I.  Minimum  description  length  criterion 
Data  Fi,  I2,  are  assumed  to  be  independent  with  an 

unknown  density  p.  Let  a  sequence  of  parametric  fami¬ 
lies  be  given.  Each  family  has  a  density  Pkiy\0),  parame¬ 
ter  space  Qk,  and  codelengths  L{k),  L{9\k)  for  the  model 
index  k  and  parameters  ^  in  a  discrete  subset  0^  C  0*. 
The  codelengths  are  assumed  to  satisfy  Kraft’s  inequality. 
Then  mingg§^{log  l/p*(y"|0)  -|-  L{6\k)  +  i(l;)}  is  the  length 
of  a  uniquely  decodable  code  for  the  data,  where  pk{Y”\6)  = 
n"=iPA(F|^)-  The  index  k  and  parameter  value  6  achieving 
the  minimum  description  length  (MDL)  provides  the  density 
estimator  p{y)  =  p^(2/|0)  [2,3]. 

The  data  compression  quality  is  measured  by  the  redun¬ 
dancy  of  the  MDL  code,  which  is  bounded  by  the  index  of  re- 
solvability  Rn(p)  =  minfc,e{jD(p||pfc,e)-t-(l/n)(L(5|fe)-|-i(fc))}, 
where  D{p\\g)  denotes  the  KuUback-Leibler  divergence  [2]. 

The  statistical  accuracy  of  the  MDL  estimator  of  the  den¬ 
sity  is  also  bounded  by  this  index  of  resolvability  [2].  Indeed, 
EcP(p,p)  <  0{Rn{p))  where  d{p,g)  is  the  Hellinger  distance. 

Usual  choices  of  parameter  discretization  lead  to  a  penalty 
terms  of  (m*/2)logra  where  rrik  is  the  dimension  of  the  fcth 
family.  Then  the  resolvability  is  minimax  optimal  for  p  in  any 
of  the  parametric  families,  but  it  is  suboptimal  by  a  logarith¬ 
mic  factor  for  p  in  smooth  nonparametric  classes. 

Here  the  discretized  parameter  spaces  are  modified  to  allow 
penalty  terms  of  order  m*,  without  excessive  loss  in  log  like¬ 
lihood  for  smooth  densities.  As  a  consequence  of  the  removal 
of  the  logarithmic  factor,  the  redundancy  and  the  statistical 
risk  will  achieve  the  minimax  optimal  rates.  Other  modifi¬ 
cations  are  needed  for  irregular  models  such  as  neural  nets, 
which  retain  the  logarithmic  factor. 

II.  Geometrically  regular  families 

We  consider  cases  in  which  sequences  of  parametric  models 
provide  accurate  approximations  with  parameter  values  in  an 
ellipse  Er,s,m  =  {6  £  R”'  :  with  accuracy 


D{p\\Pm,e)  <  cr^lm^‘,  where  r,s  are  unknown.  The  models 
are  chosen  such  that  D{pm,e\\Pm,e')  is  bounded  by  a  constant 
times  the  squared  Euclidean  distance  between  parameters  6 
and  9'.  These  conditions  hold  for  instance  when  the  loga¬ 
rithm  of  the  density  on  an  interval  is  parameterized  using  a 
polynomial  or  trigonometric  expansion  of  degree  m  and  the 
true  log-density  has  a  bound  on  the  norm  of  its  sth  deriva¬ 
tive.  (The  conditions  also  hold  in  a  regression  setting  with 
Gaussian  errors  and  smooth  regression  functions  modeled  us¬ 
ing  polynomial  or  trigonometric  series.) 

The  discretized  parameter  space  ©m  is  taken  to  be  the 
union  for  all  positive  integers  r,s,£,  of  the  ellipses  Er,s,m  in¬ 
tersected  with  a  cubical  grid  Ge,m  spaced  at  width  l/t  in 
each  coordinate.  An  evaluation  of  the  cardinality  of  the  el¬ 
lipse  restricted  to  the  grid  shows  that  we  may  set  L{9\m)  = 
mlog(Jcr)  -b  0(log(rs^)),  where  J  =  .  This  code¬ 

length  is  of  order  m  for  bounded  J  whereas  it  is  of  order 
(m/2)  log  n  when  J  is  of  order  y/n.  The  corresponding  MDL 
criterion  leads  to  estimates  m,  f ,  s,  £,  9,  and  p  =  p^  g.  Plug¬ 
ging  the  approximation  and  codelength  bounds  into  the  re¬ 
solvability  leads  to  the  rate  which  is  minimax 

optimal  for  the  redundancy  and  for  the  statistical  risk.  The 
optimal  rate  is  achieved  adaptively,  that  is,  in  the  absense  of 
knowledge  of  the  index  of  the  smoothness  class. 

III.  Neural  nets 

Analogous  treatment  for  functions  of  d  variables  with  the 
usual  expansions  leads  to  an  exponentially  large  parameter 
dimension  m*  =  k'^  is  minimax  optimal  yet  requires  exponen¬ 
tially  large  samples  sizes  to  obtain  accurate  estimates.  For 
practical  inference,  it  is  necessary  to  consider  more  restrictive 
function  classes  and  more  parsimonious  models. 

One  useful  condition  is  that  the  spectral  norm  Cf  = 
f  |w||/(a))|c?u;  have  not  too  large  a  value,  where  /  denotes  the 
Fourier  transform  of  the  target  function  /.  Sparce  trigono¬ 
metric  or  sigmoidal  expansions  fmix)  =  Ci<t>{ai  ■  X  -f  bi) 

with  a  fixed  sinusoidal  or  sigmoidal  function  0  (nonlinearly  pa¬ 
rameterized  by  a;  and  6,)  provide  an  approximation  error  of 
II/  —  /m|P  <  C|/m  and  a  complexity  per  sample  size  of  order 
md{logn)ln,  yielding  a  resolvability  of  order  c/(d(log  n)/n)*^^ 
as  shown  in  [1].  Here  the  number  and  choice  of  terms  is 
selected  using  a  description  length  criterion  with  penalty  of 
(#  param.)log  n  times  a  constant.  The  resulting  resolvability 
exhibits  more  favorable  behaviour  in  high  dimensions  than  is 
possible  with  linear  models. 
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SOME  ESTIMATION  PROBLEMS  IN  INFINITE 
DIMENSIONAL  GAUSSIAN  WHITE  NOISE 

I.Ibragimov/St-Peterburg  branch  of  Mathematical  Institute  RAN, 
R.Khasminskii,2Dept.  of  Mathematics  Wayne  State  University, Detroit, MI 
48202,USA. 

Abstr&ct — Methods  of  the  Informstion  Theory  snd  Approximation  Theory 
are  used  to  obtain  the  conditions  for  the  existence  of  consistent  estimators  for 
the  observations  in  a  Gaussian  white  noise  in  a  Hilbert  space. 


0.1  statement  of  problem 

Let  if  be  a  Hilbert  space  and  Q  a  symmetric  positive  operator  in  if  .Let 
WQ{t)  be  a  Q-Wiener  process  in  the  terminology  [1].  Let  L2(0, 1)  =  L2  be 
the  Hilbert  space  of  if-valued  functions  s  with  the  inner  product  and  norm 

(•si,'S2)=  /  (5i(t),S2(0)Hdt;  =  (s,s). 

Jo 

We  assume  that  the  process  Xe{t),0  <  t  <  1,  is  observed,  and 

dX^{t)  =  S{t)dt  +  edwgit)  (1) 

It  is  known  a  priori  that  the  ’’signal”  s  runs  a  known  set  L  C  L2  and  the 
intensity  £  and  correlation  operator  Q  of  &  ” noise’'  dwQ{t)  are  known  to  a 
statistician.  The  problem  is  to  estimate  the  value  $(s)  of  a  known  function 
$  ;  L2  — >  U  {U  is  an  Euclidean  or  Hilbert  space). The  estimation  of  S  and 
the  estimation  of  finite  dimensional  parameter  in  S  can  be  imbedded  in  this 
general  scheme. 

0.2  LAN  property 

Let  Pg'^  be  the  probability  distributions  associated  with  A%,  and  P^he  the 
distribution  of  £u;Q(f).It  is  well  known  [2],  that  for  E  C  L2  the  measures 
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Pg^  are  mutually  absolutely  continuous  and 

^p(')  1  1 

=  exp{  /  (Q-'/\dws(t))  -  i||/.|p) 
dPg  ’  ->0  ^ 

It  follows  that  the  family  of  measures  {Pg,  5  €  S)  satisfies  the  LAN  condition 
in  the  sense  of  [3.] 

This  fact  implies  the  minimax  lower  bound  of  the  estimation  risks  for  $(5) 
and  allows  to  investigate  the  concept  of  efficient  (  asymptotically  )  estimation 
for  this  model.Some  natural  examples  are  considered. 

0.3  The  existence  of  consistent  estimators 

If  neither  $'(5)  nor  are  Hilbert-Shmidt  operators  it  is  impossible  to 
guarantee  the  existence  even  of  consistent  estimators  for  $(5). Nevertheless 
methods  of  the  information  theory  and  theary  of  approximation  allow  to  pro¬ 
pose  some  necessary  and  sufficient  conditions  for  the  existence  of  consistent 
in  some  metric  estimators  and  to  find  the  rate  of  convergence  of  risks  to  zero 
when  £  — >  O.For  example  let  be  a  bounded  set  in  L2.Let  $  :  L2  — > 

be  a  linear  operator, B  is  a  Banach  space. Then  uniformly  consistent  estima¬ 
tors  of  S  exist  iff  is  a  compact  operator. 

Our  approach  generalizes  the  results  of  [4]. 
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Abstract  —  Local  polynomial  fitting  for  the  estima¬ 
tion  of  a  general  regression  function  and  its  derivatives 
for  p-mixing  and  strongly  mixing  processes  is  consid¬ 
ered.  Joint  asymptotic  normality  for  the  regression 
function  and  its  derivatives  is  established. 

I.  Introduction 

Local  polynomial  fitting  has  been  studied  in  recent  years  un¬ 
der  the  assumption  of  i.i.d.  observations  and  has  been  shown 
to  possess  very  useful  statistical  properties  in  the  context  of 
curve  estimation.  This  paper  considers  a  time  series  setting 
and  treats  the  following  regression  estimation  problem.  Let 
{Xi,Yi}  be  a  stationary  process  and  let  ^  be  a  measurable 
function  on  the  real  line.  Assume  that  £lV’(Fi)|  <  oo  and 
define  the  regression  function 

m(x)  =  E[tl>{Yi)\Xi  =  x]. 

Estimates  of  m(i)  and  its  first  p  derivatives,  via  a  local  polyno¬ 
mials  fit,  are  considered.  Special  cases  include  the  estimation 
of  conditional  distributions  and  densities  V’(F)  =  <  y}, 

conditional  moments  V’(F)  =  Y^,  and  d-step  prediction  in 
time  series  Yi  =  X,+d.  The  joint  asymptotic  normality  of 
m(x)  and  its  associated  first  p  derivatives  is  established  for 
mixing  processes  {X;,?;}. 

II.  Formulation 

If  the  (p  -I-  1)"*  derivative  of  m(z)  at  the  point  x  exists,  we 
approximate  m{z)  locally  by  a  polynomial  of  order  p: 

m{z)  X  m(i)-l - l-m^'’^(i)(2-i)'’/p!  =  ^oH - 

One  then  carries  a  local  polynomicd  regression  by  minimizing 

i=l  \  }=0  / 


Putting 


^  >Sn,0 

>  = 

{  tn,0  \ 

^  ^n,p 

*  ^n,2p  j 

^  ^n.P  / 

the  solution  to  (2)  can  be  expressed  as 

P{x)  =  diag(l,h“\---,A“'’)5;r*l„- 


(5) 

(6) 


Denote 


III.  Results 


fij  = 


K(u)du, 


xY  {u')du. 


and 


5  = 


^  Po 

•  PP  \ 

,  5  = 

Vo  • 

••  Vp  ^ 

\  /'P  •• 

•  P2p  / 

^  pp+1  '' 

Vp  ■ 

•  •  V2p  ) 

£  = 

^  P2p+1  y 

(7) 

(8) 


We  only  state  here  one  result  along  with  the  conditions  on 
the  mixing  coefficients.  See  [1]  for  the  complete  analysis  and 


results.  ^ 

Condition  1.  Assume  that  —  0  and  (nh„)/log^(n)  -♦ 
00  and  put  s„  =  {nhnf^ For  p-mixing  and  strongly 
mixing  processes,  we  assume  that 


{nlhnf^^p{sn)  ^  0  and  (n//i„)'^"a(s„)  0,  as  n  oo. 

Theorem  .  Under  Condition  1,  if  hn  =  0[n 
then,  os  n  — ♦  oo, 


where  K{-)  denotes  a  nonnegative  weight  function  and  h  — 
a  smoothing  parameter  —  determines  the  size  of  the  neigh¬ 
borhood  of  X.  If  ^  =  (,9o,  •  •  • ,  4p)  denotes  the  solution  to  the 

above  weighted  least  squares  problem,  then  by  (1),  j\fij{x) 
estimates  m^^\x),j  =  0,---,p.  Minimizing  (2)  leads  to  the 
following  set  of  equations:  Let  Kh(x)  =  K(x/h)/h  and  let 


i=l 


(3) 


(4) 


^This  work  was  supported  by  the  Office  of  Naval  Research  under 
Grant  N00014-90-J-n75. 


(dias(I, •  ■  ■ , «)®.)  - £(*)! -  '‘**(,”*+1)!*'*^'’^^) 
-£*  A(0,<t"(x)5-'55-V/(*)) 

where  (7*(x)  =  var(^i>(y)|A'  =  x),  at  continuity  points  of  (Y  f 
whenever  f{x)  >  0. 

Remark.  The  theorem  gives  the  joint  asymptotic  normal¬ 
ity  for  the  estimators  {m^^^(x)  =  j!^j(x)}^_o.  The  asymptotic 
normality,  “bias”,  and  “variance”  of  the  individual  compo¬ 
nents  follow  immediately  from  the  theorem. 
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Let  {X,}  be  a  sequence  of  i.i.d.  real  valued  ran¬ 
dom  variables  with  common  unknown  density  /.  We 
denote  by  fi  the  measure  with  density  /.  We  con¬ 
sider  the  histogram  estimate  /„  of  /  built  from  a  par¬ 
tition  Vn  =  {^nj}  with  interval  size  /i„  >  0  that  is 
/„(x)  =  A'n(^n(*))/h„,  where  A„(x)  =  A„,,-  if  x  6  An,i 
and  fin  is  the  empirical  measure.  Introduce  the  following 
notation: 

Via)^Var  (^|Ar| |[(1  - 

where  a  >  0  and  TV  is  a  standard  normal  Af(0, 1)  random 
variable. 

Theorem  1  ([£]):  Iff  is  continuously  differentiable  and 
ifhn  =  cn~^l^  then 

y/KiWfn  -  /II  -  E\\fn  -  f\\)  /«T  ^  ^^(0,  1), 
where  =  fV  ^  °  ^  ^  /. 

One  can  show  that  <  1  —  This  should  be  compeired 
to  the  rate  of  convergence  of  £'||/n  —  /||,  which  is  at  least 
of  order  for  differentiable  /,  and  it  can  be  achieved 

for  hn  =  cn~^l^. 

We  consider  the  problem  of  estimating  an  unknown 
probability  density  function  in  information  divergence. 
If  fi  and  V  are  probability  measures  on  the  real  line,  ab¬ 
solutely  continuous  with  respect  to  a  <7’-finite  measure  A 
with  densities  /  and  g  respectively,  then  the  information 
divergence  between  fi  and  v  is  defined  by 

I{l^,v)=  f  f{x)log^^X{dx)  =  D{f,g). 

Jr 

Barron,  Gyorfi  and  van  der  Meulen  (1992)  showed  that 
if  there  exists  a  known  density  g  such  that  D{f,  g)  <  oo, 
then  one  can  construct  a  density  estimator  as  follows: 
define  a  sequence  of  integers  m„,  and  put  hn  =  l/m„ 
.  Let  1/  denote  the  probability  measure  with  density  g. 


Introduce  partitions  P„  =  ”  = 

2,...,  of  the  real  line  such  that  the  j4„,,’s  are  intervals 
withi/(j4„,,)  =  hn.  For  a  given  sequence  a„  =  l/(n/i„-fl) 
consider  the  following  density  estimate: 

fn(x)  =  ((1  -  an)fin{An(x))lhn  +  an)g{x). 

If  in  addition  lim„ _ ^oo  hn  =  0,  lim„ — ►oo  uhn  =  oO) 

lim„_oo  ■D(/,  /„)  =  0  a.s. 

Theorem  2  (fSf):  Let  he  the  support  set  of  ft.  Under 
the  conditions  of  consistency 

n^/^[D{f,  fn)  -  E{D{f,  /„))]  ^  Af(0,  a% 

where  cr^  =  v{S)i)  >  0. 

For  the  choice  m„  =  n*/®  the  rate  of  convergence  of  the 
random  part  of  the  divergence  error  is  of  order 
and  under  some  restrictive  conditions  on  the  unknown 
density  / 

E{D{f,fn))<0{n-^f^). 
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Abstract  —  We  discuss  optimal  bandwidth 
choice  and  optim£d  convergence  rates  for  density 
estimation  with  dependent  data,  as  the  amount  of 
information  in  the  sample  is  altered  by  adjusting 
the  range  of  dependence. 

I.  Introduction 

Assume  that  data  are  observed  from  a  stationary  stochas¬ 
tic  process  that  may  be  taken  to  be  an  unknown  function 
of  a  Gaussian  process.  Thus,  the  strength  of  dependence 
is  determined  entirely  by  a  single  sequence  of  numbers, 
the  covariances  7(2);  this  makes  it  relatively  straightfor¬ 
ward  to  appreciate  the  influence  of  different  strengths  of 
dependence  on  various  aspects  of  bandwidth  choice,  even 
up  to  terms  of  second  or  third  order.  Let  us  assume  here, 
for  the  sake  of  simplicity,  that  7(i)  ~  ci~°  for  constants 
c  0  and  a  >  0.  Then  smaller  values  of  a  correspond 
to  less  information  in  a  data  sequence  of  given  length  n 
from  the  process,  and  hence  to  slower  convergence  rates. 
Surprisingly,  the  traditional  dichotomy  of  short-range  ver¬ 
sus  long-range  dependence,  or  equivalently  a  >  1  versus 
a  <  1,  does  not  have  a  major  role  to  play  in  the  band¬ 
width  choice  problem.  We  shall  discuss  the  effect  that 
the  value  of  a  has  on  optimal  bandwidth  choice  and,  cor¬ 
respondingly,  on  convergence  rates. 

II.  Outline  of  Main  Results 

In  the  case  of  density  estimation  based  on  a  second-order 
kernel,  the  “barrier”  normally  encountered  at  a  =  1  oc¬ 
curs  instead  at  a  =  4/5.  When  a  >  4/5  the  minimum 
mean  integrated  squared  error  (MISE)  is  eisymptotic  to 
a  constant  multiple  of  which  is  identical  to  that 

value  which  it  enjoys  in  the  case  of  independence  (effec¬ 
tively,  a  =  00).  Indeed,  even  the  value  of  the  constant  is 
identical  to  that  which  it  would  be  for  independent  data. 
Furthermore,  for  such  a’s  the  deterministic  bandwidth 
that  minimizes  MISE  agrees  even  to  second  order  with 


its  counterpart  in  the  case  of  independent  data;  under 
short-range  dependence,  where  a  >  1,  the  agreement  is 
up  to  (but  not  including)  third  order. 

When  a  <  4/5,  minimum  MISE  is  of  size  n"“,  but 
curiously,  provided  that  2/5  <  a  <  00,  the  bandwidth 
that  produces  the  overall  minimum  still  agrees  to  first 
order  with  its  counterpart  in  the  case  of  independent 
data.  Thus,  very-long-range  dependence  is  allowable  be¬ 
fore  much  change  has  to  be  made  to  the  optimal  band¬ 
width  formula.  Only  when  a  <  2/5,  which  is  a  context 
of  particularly  long-range  dependence,  is  there  a  large 
difference  between  the  first-order  properties  of  the  MISE- 
optimal  bandwidth  under  dependence,  and  its  counter¬ 
part  for  independent  data. 

What  is  more,  even  when  a  <  2/5  the  bandwidth  ap¬ 
propriate  for  independent  data  produces  first-order  mini¬ 
mization  of  MISE.  This  is  a  consequence  of  the  fact  that, 
whenever  a  <  4/5,  adjusting  the  bandwidth  in  the  vicin¬ 
ity  of  the  optimum  has  an  effect  only  on  second-  and 
higher-order  terms;  to  first  order,  MISE  does  not  depend 
on  bandwidth.  This  result  is  rather  striking  to  researchers 
who  are  familiar  only  with  the  case  of  independent  data, 
where  first-order  adjustments  to  bandwidth  always  affect 
first-order  features  of  performance. 

More  generally,  if  the  kernel  is  of  order  r  >  2  then  the 
“boundaries”  at  2/5  and  4/5  change  to  r/(2r-t-  1)  and 
2r/(2r-|- 1),  respectively. 

These  results  indicate  that  those  practical  bandwidth- 
choice  rules  that  have  been  proposed  for  independent  data 
and  are  based  on  plug-in  rules,  have  straightforward  gen¬ 
eralizations  to  certain  types  of  dependent  data,  even  un¬ 
der  very  long-range  dependence.  Generally  speaking  this 
is  true,  although  there  are  some  qualifications. 
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Abstract  -  Lai^e  deviations  estimates  yield  a 
convenient  tool  to  study  asymptotics  of  Gibbs  fields. 
Applications  to  parametric  estimation  and  detection 
of  phase  transition  are  given. 

I.  INTRODUCTION 

Gibbs  Random  Fields  (GRF)  provide  pertinent 
statistical  models  for  spacial  data  i  €  Z**,  where 
important  features  of  the  dependence  structure  can  be 
captured  in  a  very  natural  way.  An  important  issue  is 
image  analysis  via  D.  §  S.  Geman's  Bayesian  approach. 

n.  PARAMETRIC  FAMILIES  OF  GRF 

We  are  given  for  each  0  6  0  c  RP  a  compatible  family 
indexed  by  finite  A  c  of  conditional  distributions  flf^A 
of  Xa  =  (Xi)i  g  A  given  X^c ;  these  fl^^A  are  related  by  a 
natural  translation  invariance  property.  A  distribution  P 
having  these  spedfied  conditional  distributions  is 

called  a  GRF.  First  order  phase  transition  occurs  when 
the  set  G(0)  of  such  GRF  contains  more  than  one  element ; 
this  situation  is  characterized  by  the  fact  that  the  set 
Gs(0)  of  stationary  elements  of  G(0)  does  not  reduce  to  a 
singleton.  Note  that  this  is  not  an  effect  of  ill 
parametrization,  but  an  intrinsic  phenomenon.  Then,  some 
GRF  are  not  ergodic,  and  even  worse,  it  may  exist  some 
GRF  which  are  not  translation  invariant.  Statisticians  in 
front  of  real  data  should  not  assume  invariance  in  general. 

ffl.  LARGE  DEVIATIONS 


Ie(Q)  =  O«QeGs(0)  (2) 

The  relations  (1)  and  (2)  simply  state  that  the  data  will 
not  behave  worse  than  the  worse  stationary  GRF. 

TV.  CONSISTENCY  CRITERIA  FOR  PARAMETRIC 
ESTIMATORS 

Let  A  be  ti\e  window  of  observation,  and  ^  be  any 
maximizer  of  some  objective  function  0  Ica  (0  ;  X^). 
Assume  that  there  exists  a  real  continuous  function 
K(0;Q)  with 

i)  kA(0;xA)  =  K(0;RA.x)  +  eA  and,  Bm  si?)  Ea  =  0 

x,e€e 

ii)  0  is  the  unique  maximizer  of  K(. ;  P),  0  e  0  ,  P  e  Gs(0) 

iii) 0  is  compact. 

Then,^  is  a.s.  consistent. 

The  previous  criterium  applies  to  classical  estimators 
in  a  general  setup.  The  question  of  asymptotic  optimality 
also  suffers  from  the  breakdown  in  the  central  limit 
theorem,  related  to  phase  transition  :  it  can  be  treated  via 
large  deviation  using  Bahadur's  approach.  At  last,  prior 
to  the  use  of  gaussian  statistics  and  tests,  one  would  like 
to  know  from  the  data  themselves  if  phase  transition  hold 
or  not. 

V.  DETECTING  PHASE  TRANSITION 

For  cubic  boxes  A  we  now  choose  smaller  cubic  boxes 
A'  such  that ,  as  A  T  2?*, 


The  empirical  field  based  on  a  configuration 

X  =  (Xi)i  g  zd  and  on  a  cubic  box  A  is 

R  _  1  V  . 

^A,x  -  ■ 


IaI 


''HX 


i  E  A 


with  Tj  the  shift  operator.  When  it  holds  the  ergodic 
theorem  states  that  the  empirical  field  is  a  good  guess  for 
the  actual  GRF.  In  general  we  will  use  the  following  as  a 
substitute.  In  general,  large  deviations  estimates  hold,  for 
all  P  €  G(0),  and  they  mean  heuristically 


lA'I'^Log  |a|-»0,  |A'|-^(Log  |a|)‘‘/<‘1-^) -^oo  (3) 

Define  the  set  Aa  (X)  of  moving  empirical  fields  Ri  +  a’  ,  x 
based  on  all  the  translates  i  +  A'  of  A'  which  are 
included  in  the  window  of  observation  A .  Generalized 
Erd6s-R6nyi  laws  state  that  for  P  €  G(0) ,  under  the 
condition  (3)  it  holds  P  -  a.s. 

AA(X)  =  Gg(0)  (4) 


p  {Ra,x  close  to  Q}  ~  exp  -  I A I  le  (Q) "  (1) 


in  the  sense  of  Hausdorff  convergence  of  closed  sets. 


for  some  non-negative  entropy  functional  defined  on 
the  set  of  random  fields.  Moreover 


1 
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The  result  (4)  shows  how  to  estimate  consistently  the 
set  of  all  stationary  GRF  with  the  same  parameter  0  as 
the  underlying  one.  Then  one  can  asymptotically  detect 
phase  transition  from  a  single  sample.  More  practical 
versions  of  (4)  may  be  given,  and  studied  via  simulation 
experiments. 
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Abstract  —  Large  deviation  theory  is  used  to  obtain 
the  rate  distortion  theorem  for  Gibbs  distributions 
together  with  exponentially  small  error  probabilities. 

Let  £7-  be  a  Gibbs  distribution  on  flo  with  finite  range  in¬ 
teraction  and  fio  finite.  For  simplicity  we  assume  that  <t  is 
unique.  For  any  domain  G  C  containing  the  origin,  define 
RaM^)  =  where  G  CC  A  C  2".  Large  devi¬ 

ation  theorems  [1]  provide  asymptotically  exponential  upper 
and  lower  bounds  on  the  probability  that  the  empirical  dis¬ 
tribution  i?A,G(w)  under  a,  deviates  in  variational  norm  from 
the  marginal  aa  of  <t  on  G,  as  A  tends  to  infinity.  In  particular 
these  hold  if  is  a  product  measure.  Using  these  theorems 
many  of  the  standard  asymptotic  results  of  errorless  coding 
theory  can  be  neatly  formulated  and  extended  to  Gibbs  ran¬ 
dom  fields,  see  [2]. 

Here  we  present  the  application  of  these  theorems  to  coding 
with  distortion,  more  or  less  following  the  proof  in  [3].  Let 
Z(-,  •)  be  some  distance  on  Qo  and  define 

A(wa,Wa)  = 

Given  5  >  0  and  A  >  0,  what  size  codebook  is  needed  so  that 
with  very  high  probability  a  random  sample  wa  from  will 
find  a  code  word  such  that  A(a»A,  Wj,)  <  A  -|-  5? 

Let  An  be  an  increasing  sequence  of  n  x  n  domains  .  Fix 
I  and  set  k„  =  n/l.  Henceforth  the  n  subscript  is  omitted 
for  notational  ease.  Let  Gij,  i,  j  =  0, . . .  k  —  1  be  the  k^  non¬ 
overlapping  /  X  /  domains  in  A.  Set  G  =  Goo.  Let  Q(ai(3|a)G) 
be  a  conditional  probability  distribution,  and  Q{a}G,a)G)  — 
(rai‘J^G)Qia)a\ua)  be  the  joint  probability  on  D®  x  O®.  The 
marginal  on  the  second  coordinate  is  denoted  <32(^0).  Define 


Ra{Q) 


El  A(wg,  a'G)Q(a'G, 


El  Q(t*^G,u;G)log 


Q{u'g\u>g) 


Thus  Aq  is  the  expected  distortion  under  the  joint  distribu¬ 
tion,  and  Rg{Q)  is  the  average  mutual  information  of  the  two 
coordinates. 

For  any  two  distributions  ctg, t^g  on  fi®  let  \ffG  —  T^a\  = 
maxj^G  |(TG(<^G)'~7rG(‘*’G)|.  Let  RA,G{a^)  be  the  block  empirical 

distribution  on  fi®,  considering  only  disjoint  blocks.  Apply¬ 
ing  the  large  deviation  theorems  to  block  Gibbs  distributions 
where  each  disjoint  G  block  is  aggregated  as  one  site  we  get 
that  outside  a  set  He, a  of  exponentially  small  probability  in 
,  the  frequency  of  occurence  ^(wg)  of  a  specific  configura¬ 
tion  uiG,  without  overlaps,  in  wa  is  within  e  of  its  underlying 
probability,  (Tg(wg)- 

^This  work  was  supported  by  Grant  ARO  DAAL03-92-G-0322. 


For  MG  €  0?  let  c(mg)  =  Ea-^  wg)/Aq. 

For  each  domain  Gij  choose  independently  fiom  the  dis¬ 
tribution  Q2,  to  obtain  another  configuration  from  the 
product  distribution  Q  =  ®ij=iQ2{')  on  Let  wa  be  a  con¬ 
figuration  in  Using  the  fact  that  \n(uG)/k'^  -  aaiua)]  < 

e,  the  probability  that  A(aJA,t^A)  <  Aq  -f-  ^  is  bounded  below 
by 

n(uG)>0  '  “Gij=“G 

where  /^(mg)  =  c(ug)(Aq  -f-  5l2)l{<7a(ua)  —  e)- 

Using  large  deviation  results  for  i.i.d  distributions,  given 
arbitrary  7  >  0,  for  sufficiently  large  k,  each  term  in  the  above 
product  is  bounded  below  by  exp[— 1:^(<tg(mg)  +  e)(.7(«G)  + 
7)],  where  J{ug)  =  infF(uo)  Q2)  is  the  infimum  ofRKL 

divergences  with  respect  to  Q2  over  the  set  of  measures 

•  r/  \  /  /it  '1  /  ,  /  w  c(mg)(Aq  +  5/2) . 

jP(mg)  =  {ttg;  /  A(UG,‘^a)T^a(dwG)  <  — _  e  ^ 

From  the  choice  of  c{ug')  it  follows  that  the  conditional  distri¬ 
bution  Q(’|mg)  €  F(ug)-  Aggregating  this  over  all  mg’s  found 
in  WA  we  have 

lim E log  S^A(a7A!‘i’A)  ^  •^<3  +  ^  Rg{Q)  —  7  > 

with  7'  — ►  0  as  e  — ♦  0. 

Taking  I  =  exp  [m^(Ag((?)  +  7' +  7")]  with  7"  >  0,  choose 
L  independent  samples  v^iO  =  from  Q.  Using  the 

lower  bound  above,  it  is  easily  shown  that  with  probability  ex¬ 
ponentially  close  to  1,  every  element  of  a  1®  within  distance 
less  than  Aq  -|-  ^  from  at  least  one  of  the  in  the  random 
sample,  and  7”  — ►  0  as  e  — >  0. 

Theorem:  There  exist  constants  c,d,a,ld  >  0  such  that 
with  probability  1  —  ce“"  “  a  random  choice  of  L  independent 
samples  from  the  distribution  Q  on  Q*  will  provide  a  codebook 
of  rate  (Ag(Q)  +  7'+7")  per  pixel,  for  which  any  configuration 
WA  €  Ai  has  a  codeword  va  such  that  A(wa,  ma)  <  Aq  -f  i5, 
and  (T{Be,A)  <  de~’'^^.  Moreover  7',  7"  — >•  0  as  e  — <■  0  so 
that  it  is  asymptotically  possible  to  code  samples  from  a  with 
minimal  rate 

Ag(A)  =  inf{iiG(Q);  Aq  <  A}  • 

Observe  that  this  rate  distortion  curve  depends  on  the  base 
domain  G.  The  larger  G  the  lower  the  rates  will  be  for  fixed 
distortion. 
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Abstract  —  We  present  a  survey  of  design  problems 
and  results  that  arise  in  the  prediction  and  parameter 
estimation  of  stochastic  partial  differential  equations. 
The  aim  is  to  better  understand  some  unavoidable 
errors  that  occur  in  the  discretization  of  SPDEs,  and 
available  methods  for  minimizing  these  errors. 

1.  Introduction 

Solutions  to  many  prediction  and  estimation  problems  associ¬ 
ated  with  continuous  Gaussian  Markov  fields  satisfy  minimum 
principles  and  may  be  cheiracterized  eis  solutions  of  stocheistic 
boundciry  value  or  initial  value  problems,  see  e.g.  [1],  [2],  [3], 
and  [4].  These  characterizations  provide  a  theoretical  basis 
for  the  calculation,  but  in  the  implementation  of  these  calcu¬ 
lations  numerous  issues  Eirise.  A  typical  problem  may  involve 
a  smooth  elliptic  boundary  problem  on  a  smooth  domain,  with 
boundeu-y  data  that  must  be  empirically  determined,  but  typi- 
Ccdly  this  data  will  be  generalized  functions  and  can  not  be  in¬ 
terpreted  as  classical  functions.  A  Ccireful  analysis  is  required 
to  determine  the  relative  merits  and  limitations  of  different 
discretizations  of  such  a  problem.  This  paper  presents  ex¬ 
amples  which  illustrate  these  issues,  and  where  the  required 
analysis  hcis,  at  least  in  part,  been  completed. 

II.  A  General  Design  Problem  with  an 
Illustrative  Example 

Consider  a  random  field  {<j)(t,x)  :  t  €  R,x  &  that  cannot 
be  observed  on  a  restricted  set  D  C  R  x  R‘^.  It  is  desired  to 
observe  4>  off  the  set  D,  and,  based  on  these  observations,  to 
cdculate  the  conditional  expectation  ^D{t,x)  for  (t,x)  €  D. 
Of  course,  in  fcict,  <t>  can  only  be  observed  on  a  finite  set  of  N 
times  and  places  {tj,Xj)  €  D,  and  N  may  be  very  limited.  The 
following  questions  arise  in  considering  the  merits  of  designs 
and  computational  recopies.  If  N  is  fixed,  what  is  the  smallest 
possible  prediction  error  e^{x,N),  and  what  is  the  limiting 
value  e^(x,oo)  ?  What  are  the  Eisymptotics  of 

e^{x,  oo)  —  e^(x,  N) 

Emd  of 

e^(x,  oo)  -  e^{x,N)^ 
e2(x,A) 

Where  should  the  N  sites  {(tj  ,  xy)}  be  located  to  approxi¬ 
mately  achieve  the  minimum  error  e^{x,N)7 

The  simplest  illustrative  special  case  for  these  problems 
occurs  in  the  time  independent  d  =  2  case  when  (f)  satisfies 
the  elliptic  SPDE 

4>{x)  —  A(j>{x)  =  ih(x), 

with  w{x),  a  Gaussian  white  noise  in  the  plane.  In  this  case, 
when  Z)  C  is  a  bounded  domain  with  smooth  boundary  F, 
typical  results  cU'e 

'^Supported  ONR  Contract  No.  N00014-90-J-1639. 


A. 


(/-A)"<^d(x)  =  0 


for  all  X  G  D,  and  satisfies  the  boundeiry  conditions 
<^d(x)  =  (j>(x)  and  dn^oix)  =  dn4>{x)  on  F. 


B. 


e^(x,oo)  =  Gd(x,x}, 


where  GD(x,y)  is  the  Green’s  function  for  (I  —  A)^  on  the 
domain  D. 


A  careful  analysis  of  the  errors  made  in  discretizing  the 
Poisson  integral  representation  of  in  A  yields,  see  [4],  [5], 


C. 

e^(x,  oo)  —  e^(x,  N)  =  1/N, 

together  with  precise  numerical  constants  and  asymptotically 

optimal  locations  for  the  sites  {xj}. 
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Abstract.  We  briefly  describe  Markov  chain  Monte 
Carlo  algorithms,  such  as  the  Gibbs  Sampler  and 
the  Metropolis-Hastings  Algorithm,  which  are  fre¬ 
quently  used  in  the  statistics  literature  to  explore 
complicated  probability  distributions.  We  present  a 
general  method  for  proving  rigorous,  a  pnoh  bounds 
on  the  number  of  iterations  required  to  achieve  con¬ 
vergence  of  the  algorithms. 

I.  Introduction. 

Markov  chain  Monte  Carlo  techniques  have  be¬ 
come  very  popular  in  recent  years  as  a  way  of  gener¬ 
ating  a  sample  from  complicated  probability  distri¬ 
butions  (such  as  posterior  distributions  in  Bayesian 
inference  problems).  The  idea  of  such  algorithms  is 
to  define  a  Markov  chain  which  has  as  its  stationary 
distribution,  the  distribution  7r(-)  of  interest. 

Procedures  for  defining  the  Markov  chain  in¬ 
clude  the  Metropolis-Hastings  algorithm  (Metropo¬ 
lis  et  ah,  1953;  Hastings,  1970),  whereby  the  Markov 
chain  proceeds  by  “proposing”  a  new  point  accord¬ 
ing  to  some  scheme,  and  then  “accepting”  that  point 
with  a  certain  probability,  chosen  to  make  the  Markov 
chain  reversible  with  respect  to  7r(');  and  the  Gibbs 
sampler  (Geman  and  Geman,  1984;  Gelfand  and 
Smith,  1990),  whereby  the  Markov  chain  proceeds 
by  updating  the  various  coordinates  of  the  point  in 
turn  according  to  the  correct  conditional  distribution 
as  indicated  by  7r(-). 

A  fundamental  issue  regarding  such  techniques 
is  their  convergence  properties,  specifically  whether 
or  not  the  algorithm  will  converge  to  the  correct  dis¬ 
tribution,  and  if  so  how  quickly. 

II.  A  quantitative  convergence  result. 

We  describe  here  a  general  method  (Rosenthal, 
1993,  Theorem  12)  for  proving  quantitative  bounds 
on  the  time  to  stationarity  of  a  Markov  chain.  The 
method  requires  only  that  we  verify  a  drift  condition 
and  a  minorization  condition,  for  the  Markov  chain 
of  interest.  In  certain  simple  cases,  the  bound  ap¬ 
pears  to  be  small  enough  to  be  of  practical  use;  see 
Rosenthal  (1993,  1994)  and  references  therein.  For 
related  results  see  Meyn  and  Tweedie  (1993). 
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Proposition.  Let  P{x,  •)  be  the  transition  proba¬ 
bilities  for  a  Markov  chain  with  stationary  distribu¬ 
tion  7r(-).  Suppose  there  exist  e>0,  0<A<1,  0< 
A  <  oo,  d  >  f  :  X  ^  R-”,  and  a  probability 
measure  Q(-)  on  X ,  such  that  E  (f(Xi)  |  Aq  =  a:)  < 
)if(x)  -h  A  for  X  €  X,  and  P(x,-)  >  eQ(-)  for 
X  €  fd,  where  f,i  =  {x  E  X  \  /{x)  <  d}.  Then  for 
any  0  <  r  <  1,  the  total  variation  distance  to  the 
stationary  distribution  after  k  iterations  is  bounded 
above  by 

(l-e)”'  +  (o-(^-”V)'  (^l  +  ^-bE(/(Ao)))  , 

where  a~^  =  ,  7  =  1 -f  2(Ad -|- A). 
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I.  Introduction 

The  use  of  model-based  methods  for  data  compression  for 
English  dates  back  at  least  to  Shannon’s  Markov  chain  (n-gram) 
models,  where  the  probability  of  the  next  word  given  all  previ¬ 
ous  words  equals  the  probability  of  the  next  word  given  the  pre¬ 
vious  n-1  words.  A  second  approach  seeks  to  model  the  hierar¬ 
chical  nature  of  language  via  tree  graph  structures  arising  from  a 
context-free  language  (CEL).  Neither  the  n-gram  nor  the  CFL 
models  approach  the  data  compression  predicted  by  the  entropy 
of  English  as  estimated  by  Shannon  and  Cover  and  King.  This 
paper  presents  two  recently  proposed  models  that  incorporate  the 
benefits  of  both  the  n-gram  model  and  the  tree-based  models 
[1,2].  In  either  case  the  neighborhood  structure  on  the  syntactic 
variables  is  determined  by  the  tree  while  the  neighborhood  struc¬ 
ture  of  the  words  is  determined  by  the  n-gram  and  the  parent 
syntactic  variable  (preterminal)  in  the  tree.  Having  both  types  of 
neighbors  for  the  words  should  yield  decreased  entropy  of  the 
model  and  hence  fewer  bits  per  word  in  data  compression.  To 
motivate  estimation  of  model  parameters,  some  results  in  esti¬ 
mating  parameters  for  random  branching  processes  is  reviewed. 

II.  Random  Branching  Processes 

A  stochastic  context-free  grammar  (SCFG)  is  a  quintuple 
<  V;,,Fj-,R,t7o.P>.  where  is  the  set  of  V  syntactic  variables 
cr*,  Vj  is  the  finite  set  of  words  or  terminals,  R  is  the  finite  set  of 
rules,  Oo  is  the  starting  syntactic  variable,  and  P  is  the  set  of  con¬ 
ditional  probabilities  for  the  rules,  conditioned  on  the  syntactic 
variable  being  rewritten.  The  probability  of  a  derivation  from 
the  SCFG  is  the  product  of  all  of  the  probabilities  used  in  the 
derivation.  A  tree  T  is  associated  with  a  derivation  by  mapping 
syntactic  variables  used  in  the  derivation  to  nodes  in  the  tree;  the 
rule  used  for  rewriting  each  syntactic  variable  determines  the 
children  nodes.  Define  the  mean  matrix  M  to  have  its  j,  k  entry 
equal  to  the  expected  number  of  that  result  from  rewriting  Uj. 
M  has  largest  eigenvalue  p  greater  than  or  less  than  one  accord¬ 
ing  to  whether  the  SCFG  is  supercritical  or  subcritical. 

A  function  of  a  tree,  /,  is  said  to  be  additive  on  the  rules  with 
atomic  function  /  if  /(T)  =  27(r),  where  the  sum  is  over  the 
rules  r  used.  Let  n(T)  equal  the  number  of  syntactic  variables  in 
the  tree  T.  Assume  that  f  is  finite  for  all  r  s  R.  Let  T^  be  the 
truncation  of  T  at  derivation  depth  K. 

Theorem  1  [3]:  Suppose  that  M  is  strongly  connected  with 
largest  eigenvalue  p>\  and  associated  left  eigenvector  v.  Then 
for  almost  all  infinite  length  derivations, 

fim  =  ,  (1) 

ir^oo  n(Tji:)  ,=i 

where  7,-  is  the  number  of  rules  in  R(i),  the  set  of  rules  for 
rewriting  cr,  ;  p{i,  k)  is  the  probability  of  that  rule;  f  is  the  V  x  1 
vector  with  tth  entry  pii,  k)  f(i,  k). 

Extensions  of  this  theorem  include  the  convergence  of  ratios  of 
such  functions  [3].  Notice  that  /(Tjf)  and  «(Tjf)  are  derivation 
statistics  that  can  be  used  to  estimate  model  parameters  via  (1). 
The  SCFG’s  used  to  model  English  are  usually  subcritical.  The 


corresponding  result  requires  a  sequence  of  independent  deriva¬ 
tions  from  the  SCFG. 

Theorem  2:  Suppose  that  M  has  largest  eigenvalue  less  than 
one.  Let  {T'”*}  be  a  sequence  of  independent  trees  each  having 
distribution  determined  by  the  SCFG.  Then 

1  M 

lim  -  2  /(T^”))  =  Zo(I  -  M)-‘f ,  (2) 

M 

where  Zo  is  the  1  x  T  unit  vector  with  entry  one  in  the  location 
corresponding  to  the  syntactic  variable  Oq- 

III.  Proposed  Language  Models 

The  first  proposed  language  model  adds  n-gram  constraints  to 
the  tree-based  models.  For  a  given  word  string  (sentence) 
W|  =  w,  W2  •  •  •  w;v,  define  the  relative  frequency  of  WjCOi  by 

_  _1_  Vi/  1 


Theorem  3  [2]:  The  probability  distribution  on  trees  T,  p,  mini¬ 
mizing  the  Kullback-Leibler  distance  from  the  distribution  it 

defined  by  the  SCFG,  2p(T)log-^^,  subject  to  the  bigram 

^(T) 

constraints  1)]  =  coj,Q)i  e  Vj,  is 


X  X  «0ja>ino>ja,i(Wi,NT)MT)  , 

o)j€Vf  cojeV]- 


(4) 


where  Z  is  the  normalizing  constant  and  the  are  the 
Lagrange  multipliers  chosen  to  satisfy  the  constraints. 

The  distribution  (4)  induces  the  neighborhood  structure  dis¬ 
cussed  in  the  introduction.  As  with  many  random  field  models, 
computing  Z  is  problematic.  This  motivates  a  second  model,  the 
mixed  tree/chain  graph.  In  this  model,  let  T  be  the  tree  down  to 
the  preterminal  layer,  and  label  the  preterminals  for  a  particular 
derivation  by  y*,  k  =  1,2, . . . ,  Ny-.  The  probabilities  of  words 
are  determined  by  conditional  probabilities  on  the  words, 
p(w*lwj., ,  y*),  and  the  SCFG  down  to  the  preterminal  level. 

Issues  that  are  under  investigation  include:  the  decrease  in 
entropy  obtained  by  using  successively  more  complicated  mod¬ 
els;  comparative  performance  of  different  models  as  a  function 
of  the  number  of  parameters;  estimation  of  parameters  in  the  two 
proposed  models  using  the  Penn  TreeBank;  determining  the 
compressibility  of  the  Penn  TreeBank  using  our  models. 
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Abstract  -  A  load  balancing  problem  is  formu¬ 
lated  for  infinite  networks  or  graphs.  There  are 
overlapping  sets  of  locations,  each  set  having  an 
associated  possibly  random  amount  of  load  to  be 
distributed.  The  total  load  at  a  location  is  the 
sum  of  the  contributions  due  to  the  sets  that 
contain  it.  Equilibrium  is  said  to  hold  if  the 
load  corresponding  to  any  one  set  cannot  be  re¬ 
assigned  to  improve  the  balance  of  total  loads. 
The  set  of  possible  equilibria,  or  balanced  load 
vectors,  is  examined.  The  balanced  load  vector 
is  shown  to  be  unique  for  Euclidean  lattice  net¬ 
works,  in  which  the  sets  correspond  to  pairs  of 
neighboring  nodes  in  a  rectangular  lattice  in  fi¬ 
nite  dimensions.  A  method  for  computing  the 
load  distribution  is  explored  for  tree  networks. 
An  FKG  type  inequality  is  proved.  The  concept 
of  load  percolation  is  introduced  and  is  shown  to 
be  associated  with  infinite  sets  of  locations  with 
identical  load. 

SUMMARY 

A  bcJcincing  problem  is  specified  by  a  collection 
(U,  V,  N,  m),  where  U  emd  V  are  finite  or  countably  in¬ 
finite  sets,  N  =  {JV(u)  :  u  €  U}  where  N(u)  is  a  finite 
subset  of  V  for  each  u  £  U,  cind  m  =  (mu  :  u  £  U) 
where  m„  >  0  for  all  u.  For  example,  U  may  denote 
the  edges  of  a  (possibly  infinite)  graph,  V  the  vertices, 
and  N{u)  the  set  consisting  of  the  two  endpoints  of 
edge  u  for  each  u.  An  assignment  vector  is  a  vector 
/  =  (/„,„  :  u  £U,  V  £  U),  with  nonnegative  entries.  It 
is  said  to  meet  the  demand  m  if 

fu,v  =  m„  for  u£U,  (1) 

v£N(u) 

The  total  load  at  v,  x(t;),  is  given  by 

x{v)  =  ^2  €  U.  (2) 

u€U 

A  vector  x  =  {x{v)  :  o  €  U)  so  arising  from  an  as¬ 
signment  vector  /  meeting  the  demand  is  CciUed  a  load 

*This  work  was  supported  by  JSEP  Contract  N00014-90- 
J-1270 


vector.  A  load  vector  x  is  said  to  be  balanced,  if  for 
some  corresponding  /,  the  following  conditions  hold: 
For  all  u  e  t/  and  all  u,  u'  e  N{u),  fu,v  =  0  whenever 
a;(t;)  >  a:(u'). 

The  main  questions  addressed  in  this  paper  can  be 
stated  in  broad  terms  as  follows.  How  can  the  set  of 
balanced  locid  vectors  be  characterized?  It  is  not  difficult 
to  show  that  balanced  load  vectors  exist,  but  are  they 
unique?  What  is  the  distribution  of  the  load  at  a  given 
location  for  a  balanced  load  vector  when  the  demand 
vector  is  random?  Finally,  what  “global”  or  long-range 
effects  can  be  observed  in  balanced  load  vectors? 

The  highlights  of  this  paper  cire  summcirized  as  fol¬ 
lows.  The  concept  of  load  balancing  on  an  infinite  net¬ 
work  is  introduced  (in  somewhat  more  generality  than 
the  above).  Minimal  and  maximal  balanced  load  vectors 
are  shown  to  exist,  and  the  idea  of  load  beilancing  in  fi¬ 
nite  subsets  with  boundary  conditions  is  used  to  exhibit 
a  one-parameter  family  of  balanced  load  vectors  when¬ 
ever  the  balanced  load  vector  is  not  imique.  It  is  shown 
that  the  balanced  load  vector  is  unique  for  a  wide  class 
of  networks  including  rectangular  lattice  networks.  The 
concept  of  r-surplus  is  used  to  characterize  the  possible 
distributions  of  the  load  at  a  location  in  a  tree  network 
with  independent,  identically  distributed  demands.  The 
case  of  Bernoulli  demand  eind  exponentially  distributed 
demands  cire  investigated  in  some  detail.  Finally  a  no¬ 
tion  of  long  range  interaction,  locid  percolation,  is  intro¬ 
duced.  Load  percolation  is  shown  to  imply  the  existence 
of  infinite  connected  sets  of  locations  with  identical  load. 
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Abstract  —  We  describe  extensions  to  the  “best- 
basis”  method  to  select  orthonormal  bases  suitable 
for  signal  classification  (or  regression)  problems  from 
a  collection  of  orthonormal  bases  using  the  relative 
entropy  (or  regression  errors).  Once  these  bases  are 
selected,  the  most  significant  coordinates  are  fed  into 
a  traditional  classifier  (or  regression  method)  such 
as  Linear  Discriminant  Analysis  (LDA)  or  a  Clas¬ 
sification  and  Regression  Tree  (CART).  The  perfor¬ 
mance  of  these  statistical  methods  is  enhanced  since 
the  proposed  methods  reduce  the  dimensionality  of 
the  problems  by  using  the  basis  functions  which  are 
well-localized  in  the  time-frequency  plane  as  feature 
extractors. 

I.  Summary 

The  best-basis  algorithm  of  Coifman  and  Wickerhauser  [3] 
was  developed  mainly  for  signal  compression.  This  method 
first  expands  a  given  signal  into  a  dictionary  of  orthonormal 
bases,  i.e.,  a  redundant  set  of  wavelet  packet  bases  or  local 
sine/cosine  bases  having  a  binary  tree  structure.  The  nodes 
of  the  tree  represent  subspaces  with  different  time-frequency 
localization  characteristics.  Then  a  complete  basis  called  a 
best  basis  which  minimizes  a  certain  information  cost  func¬ 
tion  (e.g.,  entropy)  is  searched  in  this  binary  tree  using  the 
divide-and-conquer  algorithm.  This  cost  function  measures 
the  flatness  of  the  energy  distribution  of  the  signal  so  that 
minimizing  this  leads  to  an  efficient  representation  (or  coordi¬ 
nate  system)  for  the  signal.  Because  of  this  cost  function,  the 
best-basis  algorithm  is  good  for  signal  compression  but  is  not 
necessarily  good  for  classification  or  regression  problems. 

For  classification,  we  need  a  measure  to  evaluate  the  dis¬ 
crimination  power  of  the  nodes  (or  subspaces)  in  the  tree- 
structured  bases.  There  are  many  choices  for  the  discrim¬ 
inant  measure  V  (see  e.g.,  [1]).  For  simplicity,  let  us  first 
consider  the  two-class  case.  Let  p  =  {pi}"=i,  q  =  {gi}”=i 
be  two  nonnegative  sequences  with  ^  qi  =  1  (which 

can  be  viewed  as  normalized  energy  distributions  of  signals 
belonging  to  class  1  and  class  2  respectively  in  a  coordi¬ 
nate  system).  One  natural  choice  for  V  is  relative  entropy: 

D{p,q)  —  Pi  log(pi/gi).  If  a  symmetric  quantity  is 

preferred,  one  can  use  the  J-divergence  between  p  and  q: 
J{p,q)  =  D{p,q)  -h  D{q,p).  The  measures  D  and  J  are  both 
additive:  for  any  j,  1  <  j  <  n,  P(p,  q)  =  T>({pi}l^i ,  {qi}{^^)-\- 
^?({Pi}r=,+i,{9i}r=j+i)-  For  measuring  discrepancies  among 
L  distributions,  one  may  take  (j)  pairwise  combinations  of  V. 
The  following  algorithm  selects  an  orthonormal  basis  (from 
the  dictionary)  which  maximizes  the  discriminant  measure  on 
the  time-frequency  energy  distributions  of  classes.  We  call 
this  a  local  discriminant  basis  (LDB). 

Algorithm  1  Given  L  classes  of  training  signals, 

Step  0:  Choose  a  dictionary  of  orthonormal  bases  (i.e.,  specify 
QMFs  for  a  wavelet  packet  dictionary  or  decide  to  use  either 
the  local  cosine  dictionary  or  the  local  sine  dictionary). 


Step  1:  Construct  a  time-frequency  energy  map  for  each  class 
by:  normalizing  each  signal  by  the  total  energy  of  all  signals  of 
that  class,  expanding  that  signal  into  the  tree- structured  sub¬ 
spaces,  and  accumulating  the  signal  energy  in  each  coordinate. 
Step  2:  At  each  node,  compute  the  discriminant  measure  2? 
among  L  time- frequency  energy  maps. 

Step  3:  Prune  the  binary  tree:  eliminate  children  nodes  if  the 
sum  of  their  discriminant  measures  is  smaller  than  or  equal 
to  the  discriminant  measure  of  their  parent  node. 

Step  4;  Order  the  basis  functions  by  their  discrimination 
power  and  use  k  (<gC  n)  most  discriminant  basis  vectors  for 
constructing  classifiers. 

For  regression  problems,  we  use  the  same  algorithm  by 
modifying  Step  2  and  3  above.  In  Step  2,  we  compute  the 
prediction  (or  regression)  error  at  each  node  instead  of  the 
time-frequency  energy  distributions.  In  Step  3,  we  prune  the 
binary  tree  by  comparing  the  prediction  errors  of  each  par¬ 
ent  node  and  the  union  of  its  two  children  nodes:  eliminate 
the  children  nodes  if  their  prediction  error  is  larger  than  their 
parent  node.  We  call  the  basis  so  obtained  a  local  regression 
basis  (LRB).  One  disadvantage  is  that  the  prediction  error  is 
not  an  additive  measure  so  that  the  algorithm  is  slower  than 
the  LDB  algorithm. 

We  tested  our  method  using  the  triangular  waveform  classi¬ 
fication  (three-class  problem)  described  in  [2].  We  first  gener¬ 
ated  100  training  signals  and  1000  test  signals  for  each  class. 
Then,  we  supplied  the  raw  signals  to  LDA  and  CART  and 
obtained  the  misclassification  rates  20.90%,  29.87%,  respec¬ 
tively,  using  the  test  signals.  Finally,  we  computed  the  LDB 
from  the  wavelet  packet  dictionary  with  the  6-tap  coiflet  fil¬ 
ter,  and  supplied  five  most  discriminant  coordinates  to  LDA 
and  CART.  The  misclassification  rates  become  15.90%  and 
21.37%.  Note  that  the  Bayes  error  of  this  example  is  about 
14%  [2].  The  details  as  well  as  other  examples  and  applica¬ 
tions  of  LDB/LRB  can  be  found  in  [4],  [5],  and  [6]. 
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Abstract  —  A  brief  discussion  is  given  of  the  role  of 
approximation  and  smoothness  spaces  in  algorithms 
for  noise  removal  and  compression. 

I.  Introduction 

Compression  and  noise  removal  can  be  viewed  as  problems  of 
approximation.  Because  of  space  limitations,  we  limit  our  dis¬ 
cussion  to  cases  where  approximation  takes  place  in  a  Hilbert 
space  H  although  the  theory  applies  in  far  greater  generality. 
Let  {^iA}A€A  be  a  a  complete  orthonormal  system  for  H. 

II.  Linear  and  Nonlinear  approximation 

In  linear  approximation,  we  approximate  by  the  elements  of 
the  linear  spaces  Xn  ■=  spa,n{4>x}xeA„t  An  C  An+i  C  A, 
n  =  1, 2, . . ..  If  /  =  J2xeA  I^A€A„ 

approximation  from  Xn  and  En{f)  '■=  ^ 

approximation  error. 

In  nonlinear  approximation,  we  fix  a  number  n  >  0  and  ap¬ 
proximate  /  by  cx4>x,  where  Ao  is  an  arbitrary  subset 

of  A  with  2"  elements.  The  best  nonlinear  approximation  is 
obtained  by  taking  Ao  as  the  set  of  the  2"  indicies  A  for  which 
|ca|  is  largest.  We  denote  the  nonlinear  approximation  error 

by  <Tn(/). 

III.  Approximation  spaces 
What  elements  f  €  H  can  be  approximated  well  by  these 
methods.  For  example,  what  elements  have  an  approximation 
error  like  0(2""“).  For  a  >  0,  0  <  g  <  oo,  let  A“(L)  denote 
the  set  of  f  eH  such  that  En>it2”“  £:„(/)]«  is  finite  with 
the  usual  change  to  a  sup  when  q  =  oo.  We  replace  En(f) 
by  Onif)  to  get  the  space  A“(N).  Then,  /  is  in  A“(L)  if  and 
only  if  En>i[2"“(EAeA„+AA„  is  We  c^ 

characterize  A^(N)  only  for  special  q,  namely,  q  —  (q:  +  1/2) 
in  which  case  /  is  in  this  space  if  and  only  if  Eaga  is 
finite. 

IV.  Examples 

Let  H  =  L2{E''^)  be  the  space  of  27r-periodic  functions  on  the 
torus  and  4>k  :=  Ck,  k  €  2Z^,  with  ek{x)  :=  e  ,  k  €  ^  , 
the  complex  exponentials.  We  take  An  :=  {k  :  |^l  ^  2  }. 
Then,  the  linear  approximation  problem  corresponds  to  ap¬ 
proximation  by  the  partial  sums  of  the  Fourier  series  of  /  and 
/  e  Ag{L)  if  and  only  if  /  is  in  the  Besov  space  B“(L2(2r  )) 
(when  g  =  2,  a  =  r  is  an  integer,  this  is  equivalent  to  /  in 
the  Sobolev  space  W^{L2(T'^)).  For  nonlinear  approximation 
by  complex  exponentials,  /  €  A“(N),  a  >  0,  g  =  (a  -t  1/2) 
if  and  only  if  Efcex<i  converges;  e.g.,  if  a  =  1/2,  the 

Fourier  series  of  /  should  converge  absolutely  (Stechkin’s  cri¬ 
teria)  .  j 

Another  important  example  is  when  An  '■=  {k  ^  ^  '■ 

\ki-  --kdl  <  n}  is  the  hyperbolic  cross.  In  this  case,  .A“(L) 
is  a  Besov  like  space  with  the  usual  modulus  of  smoothness 
replaced  by  a  mixed  modulus  [1] . 

iThis  work  was  supported  by  ONR  Contract  N0014-91-J1343. 


V.  Wavelets  Examples 

Let  H  =  L2{]R)  and  €  H  he  a  univariate  scaling  function 
with  orthonormal  shifts  4>(-  —  j),  j  6  which  generates  the 
orthogonal  wavelet  ip.  The  functions?/’j,fc  :=  2  ip{2  ■  — /), 
j,k  e  Z  are  a  complete  orthonormal  system  for  TL.  For  A„  := 
{{j,k)  :  j  e  2Z,k  <  n},  the  A“(L)  are  again  Besov  spaces 
B“(L2(iR))  for  a  range  of  a  depending  on  ip.  The  nonlinear 
approximation  spaces  A.“(N)  are  the  Besov  spaces  Bq  {Lq) 

provided  g  =  (a  -|-  1/2)  [2]. 

There  are  various  multivariate  orthonormal  basis  for 
L2{JR'^)  which  can  be  constructed  from  (p  and  ip.  For  example, 
if  d  =  2,  the  usual  orthogonal  basis  used  in  wavelet  applica¬ 
tions  consists  of  the  functions  i]ji,k{x)flj2,k{y)i  /ij/zjfc  C 
with  ri,fi  either  <p  or  ip  but  not  both  (p.  The  approxi¬ 
mation  classes  for  linear  approximation  by  partial  sums  of 
wavelet  series  with  respect  to  this  basis  are  Besov  spaces 

A“(L)  =  Bq{L2{JR^))-  For  nonlinear  approximation  A“(N)  = 
R“(L,(iR"),  g=(a/2+l/2)-L 

Another  wavelet  basis,  useful  in  some  applications,  is  given 
by  the  tensor  products  ipji,ki{^)'^hMiv)^  ji,j2,ki,k2  £  2L. 
Linear  approximation  here  is  analogous  to  hyperbolic  cross 
Fourier  approximation  [3]. 

VI.  K-FUNCTIONALS 

If  y  C  A  are  two  Banach  spaces  and  /  6  A,  then 
A(/,t,A,y)  :=  inf/=6+9  ll&llx  +  t\\g\\Y,  t  >  0  is  called  the 
K-functional  of  /.  The  K-functionals  for  many  classical  pairs 
of  spaces  are  characterized.  K-functionals  can  be  used  to  de¬ 
sign  (optimal)  compression  and  noise  removal  algorithms  [2]. 
For  example,  if  A  =  L2(D)  and  Y  =  W^(L2(fl))  with  Q  (Z  IR 
a  cube  or  all  of  then  the  best  choice  g  for  fixed  t  is  given 
by  linear  approximation  (for  example  using  the  first  wavelet 
basis  in  Sect.  V).  If  T  =  W"{L2{fl)  is  the  space  of  functions 
with  mixed  r-th  derivative,  a  best  g  is  given  by  linear  hyper¬ 
bolic  (or  tensor  product)  approximation.  If  y  =  B“(L,(n)), 
g  =  (a/d  -1-  l/2)"\  then  g  is  given  by  nonlinear  wavelet  ap¬ 
proximation.  These  choices  of  g  give  linear  or  nonlinear  com¬ 
pression  algorithms  optimal  for  corresponding  function  classes 
[21- 

In  noise  removal,  one  uses  the  K-functional  for  noisy  f ;  each 
t  gives  a  noise  removal  algorithm.  Minimizing  the  expected  er¬ 
ror  with  respect  to  t  leads  to  linear  or  nonlinear  noise  removal 
algorithms  such  as  wavelet  shrinkage  [2,4]. 
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Abstract  —  Recently,  adaptive  signal  representa¬ 
tions  in  overcomplete  libraries  of  waveforms  have  been 
very  popular  [1,  5].  One  naturally  expects  that  in 
searching  through  a  large  number  of  signal  represen¬ 
tations  for  noisy  data,  one  is  at  risk  of  identifying 
apparent  structure  in  the  data  which  turns  out  to  be 
spurious,  noise-induced  artifacts.  We  show  how  to  use 
penalties  based  on  the  logarithm  of  library  complex¬ 
ity  to  temper  the  search,  preventing  such  spurious 
structure,  and  giving  near-ideal  behavior. 

I.  Adaptive  Signal  Representations 

Over  the  last  five  years  or  so,  there  has  been  an  explosion  of 
awareness  of  alternatives  to  traditional  signal  representations. 
Instead  of  just  representing  objects  as  superpositions  of  si¬ 
nusoids  (the  traditional  Fourier  representation)  we  now  have 
available  alternate  dictionaries  -  signal  representation  schemes 
-  of  which  the  Foiurier  dictionary  is  only  the  most  well-known. 
Wavelet  dictionaries,  Gabor  dictionaries.  Multi-scale  Gabor 
Dictionaries,  Wavelet  Packets,  Cosine  Packets,  Chirplets,  and 
a  wide  range  of  other  representations  are  now  available.  Each 
such  dictionary  D  is  a  collection  of  waveforms  {<f>y)-y^r,  and 
we  envision  a  decomposition  of  a  signal  a  as 

«  =  (1) 

-yer 

Depending  on  the  dictionary,  such  a  decomposition  is  a  decom¬ 
position  into  pure  tones  (Fourier  dictionary),  bumps  (wavelet 
dictionary),  chirps  (chirplet  dictionary),  etc. 

A  key  point.  The  dictionjiries  we  are  interested  in  are  all 
overcomplete.  The  decomposition  (1)  is  then  nonunique,  be¬ 
cause  some  elements  in  the  dictionary  have  representations 
in  terms  of  other  elements.  This  gives  us  the  possibility  of 
adaptation,  i.e.  of  choosing  among  many  representations  one 
which  is  most  suited  to  our  pmposes. 

II.  Best  Ortho  Basis 

Coifman  and  Meyer  have  invented  some  time-frequency  dic¬ 
tionaries,  wavelet  packets  and  cosine  packets,  which  have  a 
very  special  structure.  Certain  structured  subcollections  of 
the  elements  amount  to  orthogonal  bases;  one  gets  in  this 
way  a  wide  range  of  orthonormal  bcises  (in  fact  ^  2"  such 
orthogonal  bases  for  signals  of  length  n).  Coifman  and  Wick- 
erhauser  [1]  have  proposed  a  method  of  adaptively  picking 
from  among  these  many  b2ises,  a  single  orthogonal  basis  which 
is  the  best  one.  If  (s[B]/)  denotes  the  vector  of  coefficients 
of  a  in  orthogonal  basis  B,  and  if  we  define  the  “entropy” 
^(s[B])  =  where  e{s)  is  a  scalar  function  of  a 

scalcir  argument,  they  give  a  fast  algorithm  for  solving 

min{5(s[B])  :  B  ortho  basis  C  P} 

^This  work  was  supported  by  NSF-DMS-92-09130,  and  by  the 
NASA  Astrophysics  Data  Program. 


The  algorithm  is  fast  -  it  delivers  a  basis  in  order  nlog(n) 
time  -  and  in  some  cases  delivers  near-optimal  sparsity  resp- 
resentations. 

III.  Choice  of  Entropy  for  De-Noising 

Suppose  we  have  observations  yi  =  ai  +  Zi,  i  =  l,...,n, 
where  (si)  is  signal  and  {zi)  is  i.i.d.  Gaussian  white  noise. 
Suppose  we  have  available  a  library  £  of  orthogonal  beises, 
such  as  the  Wavelet  Packet  bases  or  the  Cosine  Packet  b^lses 
of  Coifman  and  Meyer.  We  wish  to  select,  adaptively  beised 
on  the  noisy  data  (j/j),  a  basis  in  which  best  to  recover  the 
signal  (“de- noising”).  Let  A/„  be  the  total  number  of  distinct 
vectors  occcuring  among  all  bases  in  the  library  and  let  tn  = 
■^21og(Mn).  (For  wavelet  packets,  M„  =  nlog2(n).) 

Let  y{B]  denote  the  original  data  y  transformed  into  the 
Basis  B.  Choose  A  >  8  and  set  An  =  (A  ■  (1  -f  fn))^-  Define 
the  entropy  functional 

=  ^min(y?[B],A^). 

i 

Let  B  be  the  best  orthogonal  basis  according  to  this  entropy: 
B  =  argminB6c£A(y,B). 

Define  the  hard-threshold  nonlinearity  r)t{y)  =  1“ 

the  empirical  best  basis,  apply  hard-thresholding  with  thresh¬ 
old  t  =  -v/AT: 

Theorem:  With  probability  exceeding  7r„  =  1  —  e/Afn, 

||S*  -  s||i  <  (1  -  8/A)~‘  •  An  •  minE||sB  -  s||i. 

Here  the  minimum  is  over  all  ideal  procedures  working  in  all 
bases  of  the  library,  i.e.  in  basis  B,  as  is  just  l/t[B]l{|a;[B]|>i}- 
In  short,  the  b^lsis-adaptive  estimator  achieves  a  loss  within 
a  logarithmic  factor  of  the  ideal  risk  which  would  be  achiev¬ 
able  if  one  had  available  an  orsicle  which  would  supply  perfect 
information  about  the  ideal  basis  in  which  to  de-noise,  and 
also  about  which  coordinates  were  large  or  small. 
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Abstract  —  Long-range  dependent  processes  exhibit 
features,  such  sts  Iff  spectra,  for  which  wavelets  offer 
versatile  tools  and  provide  a  unifying  framework.  This 
efficiency  is  demonstrated  on  both  continuous  pro¬ 
cesses,  point  processes  and  filtered  point  processes. 

I.  Introduction 

Many  signals,  in  many  different  domains  (solid-state  physics, 
biology,  turbulence,  communications, . . . ),  exhibit  Iff  spectra 
which  reveal  some  long-range  dependence  (LRD).  Although 
considering  LRD  processes  is  therefore  necessary,  this  remains 
a  challenging  problem  (from  the  point  of  view  of  both  model¬ 
ing  and  analysis),  thus  calling  for  new  approaches  capable  of 
supplementing  some  of  the  specific  tools  developed  so  far  [3]. 

II.  Fractional  Brownian  motion 
Fractional  Brownian  motion  (fBm)  is  the  first  well-known  ex¬ 
ample  of  a  continuous  and  LRD  process  for  which  wavelets 
proved  efficient  [4,  5,  8,  9,  10].  The  main  reasons  are  as  fol¬ 
lows:  1.  although  fBm  is  nonstationary,  its  wavelet  transform 
is  stationary  at  any  scale  (this  is  due  in  fact  to  the  station- 
arity  of  its  increments);  2.  the  Hurst  exponent  H  of  a  fBm 
can  be  deduced  from  the  variance  law  of  details  across  scales; 
3.  whereas  fBm  is  LRD,  details  of  a  dyadic  decomposition 
ate  almost  uncorrelated.  The  effectiveness  of  using  wavelets 
for  fBm  analysis  can  be  further  evidenced  by  a  comparison 
with  more  classical  techniques  devoted  to  continuous  LRD 
processes.  Given  a  fBm  5h(<)  for  which  var5/f(l)  behaves  as 
it  is  known  that  the  estimation  of  H  requites  the  use 
of  a  refined  variance  estimator,  referred  to  as  the  Allan  vari¬ 
ance.  It  turns  out  that  such  an  approach  amounts  to  using  a 
Haar  wavelet  decomposition,  with  limitations  due  to  the  low 
regularity  of  the  basis  functions  [4].  While  retaining  the  same 
principle,  mote  tegular  wavelets  offer  therefore  a  way  of  gen¬ 
eralizing  the  Allan  variance,  with  an  increased  performance. 

III.  Fractal  point  process  -  Fractal  shot  noise 

Beyond  fBm,  wavelets  ate  also  efficient  for  tracking  LRDs  in 
point  processes.  Let  us  consider  P(t)  =  ~ 

where  the  tk  are  Poisson  distributed,  with  an  intensity  A(t). 
(The  usual  Poisson  process  simply  corresponds  to  g  =  S  and 
dX/dt  =  0.)  A  LRD  process  (referred  to  as  fractal  point  pro¬ 
cess  (FPP)  [6])  can  be  constructed  within  this  model  by  choos¬ 
ing  A(t)  to  be  fractional  Gaussian  noise  (i.e.,  “derivative”  of 
a  fBm).  Starting  from  the  remark  that,  for  the  counting  pro¬ 
cess  N{T)  associated  to  a  Poisson  process,  we  have  always 
vaxN{T)  =  EiV’(r),  a  departure  from  Poisson  can  be  revealed 
by  means  of  the  Fano  factor  F(T)  =  varA(T)/EA(T’).  In 
the  case  of  a  FPP,  F(T)  ~  1  -t-  when  T  is  large  and 

H'  >  1/2  [6].  In  analogy  with  the  definition  of  F{T),  a  wavelet 
based  Fano  factor  WF{j)  can  then  be  defined  [2],  as  a  function 
of  scale  j,  by  using  both  the  variance  of  the  details  dp[;,n] 
and  the  average  of  the  approximations  ap[j,n].  The  result  is 

WF{j)  =  {2^)^Bd]>lj,n]/Eap\j,n]  ~  1  +  (2^""-*, 
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when  j  is  large  and  the  degree  of  cancellation  R  is  such  that 
R  >  ff  —  i.  When  using  the  Haar  wavelet  in  WF  and 
the  Allan  variance  in  the  estimation  of  F,  we  have  exactly 
WF{j)  =  F{2^).  Wavelets  offer  therefore  a  way  of  general¬ 
izing  the  concept  of  Fano  factor  and  increasing  its  efficiency 
when  R  is  larger  than  0.  Moreover,  the  proposed  generaliza¬ 
tion  allows  to  deal  directly  with  filtered  point  processes,  what 
the  Fano  factor  does  not.  If  we  consider  for  instance  the  model 
of  fractal  shot  noise  (FSN)  [7],  for  which  A(t)  is  a  constant  but 
g(t)  =  t-^ifO<A<t<5<  -t-oo  and  0  ekewhere,  we  ol>- 
tain  that  WF{j)  behaves  as  2”^^  when  1/A  >  2"^  >  IfB 
(with  the  only  condition  R  >  /?  —  |)  [2]. 

IV.  Spectral  analysis  of  Iff  processes 

In  any  of  the  above  cases,  the  basic  ingredient  in  the  analysis 
is  the  variance  of  the  detaik,  which  is  time-invariant.  This 
leads  to  a  unified  perspective  in  the  frequency  domain  since 
such  a  variance  reads 

BdlU,  n]  =  2^  |®(2V)|^  54/)  df, 

J  — oo 

where  5i(/)  is  the  (average)  power  spectrum  of  the  ana¬ 
lyzed  process  and  $(/)  the  Fourier  transform  of  the  analyzing 
wavelet.  A  consequence  of  this  relation  is  that  wavelet  anal- 
ysk  is  structurally  matched  to  l//“  spectra  and  provides  an 
efficient  and  unbiased  estimation  of  a,  as  detailed  in  [Ij. 
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I.  Introduction 

To  coni]3iite  the  o)jtimal  expansion  of  signals  in  redundant 
dictionary  of  waveforms  is  an  NP  complete  prohlem.  VVe  in¬ 
troduce  a  greedy  algorithm,  called  matching  pursuit,  that  per¬ 
forms  a  sulj-optimal  expansion.  This  algorithm  can  be  inter- 
pretetl  as  a  shape-gain  multistage  vector  quantization.  The 
waveforms  are  chosen  iteratively  in  order  to  best  match  the 
signal  structures.  Matching  pursuits  are  general  procedures 
to  compute  adaptive  signal  representations.  Applications  to 
speech  and  image  processing  with  dictionaries  of  Gabor  func¬ 
tions  will  be  shown,  in  particular  for  the  removal  of  noises. 

II.  Matching  Pursuit 

Let  H  I  je  a  signal  space.  We  define  a  dictionary  a.s  a  redundant 
family  P  =  of  vectors  in  H.  such  that  l^..,  |  =  1. 

We  impose  that  linear  ex]jansion  of  vectors  in  P  are  dense 
in  H.  An  examjjle  of  dictionary  is  constructed  by  dilating, 
translating  and  modulating  a  single  window  function  y(i)  of 
unit  norm.  For  any  scale  s  >  0,  frequency  modulation  (  and 
translation  »,  we  denote  =  (s.u,  and  define 

=  (1) 

x/s  s 

The  index  -)  is  an  element  of  the  set  F  =  R"*"  x  R^.  The  factor 

— ^  normalizes  to  1  the  norm  of  If  g{i)  is  even,  which 

ys 

is  generally  the  case.  is  centered  at  the  abscissa  u.  Its 

energy  is  mostly  concentrated  in  a  neighborhood  of  u.  whose 
size  is  proportional  to  s. 

A  signal  /  e  H  does  not  have  a  uniciue  representation  as 
a  sum  of  elements  of  a  redundant  dictionary.  A  matching 
]jursuit  decomposes  /  over  a  set  of  vectors  selected  from  P. 
by  successive  approximations.  Let  €  P,  we  decompose 

./' =<  >  </70  +  -R/'  (2) 

where  Rf  is  the  residual  vector  after  approximating  /  in  the 
direction  of  Clearly  y-,(,  is  orthogonal  to  Rf,  hence 

ii./f  =  i<./'.i/.o  >r+iifi./ir.  (3) 

To  minimize  iR/j.  we  must  choose  e  P  such  that  |  < 
f.y-i„  >\  is  maximum. 

Let  us  explain  by  induction,  how  the  matching  pursuit  is 
carried  further.  Let  i?°/  =  /.  We  suppose  that  we  have 
computed  the  order  residue  R'\f,  for  >  0.  We  choose, 
with  the  choice  function  C.  an  element  y.,„  £  P  which  closely 
matches  the  residue  R'\f 

l<  R"f-y-,n  >i  =  sup  |<  >I-  (4) 

leF 

The  re.sidue  7?"  /  is  sub-decomposed  into 

-p/?"+V.  (5) 

'This  work  was  supported  in  part  by  the  AFOSR  grant  F49fr20- 
9.3-1-0102.  ONR  grant  N00014-91-J-1967  and  the  Alfred  .Sloan 
Foundantion 


which  defines  the  residue  at  the  order  n-l-1.  Since  /?""'■'/  is 
orthogonal  to  y^„ 

IRVf  =  \<  >|=  +  ||/f"-‘/|-.  (0) 

If  we  carry  this  decomposition  up  to  the  order  ni.  we  obt  ain 

nt  —  1 

f=Y,  >1/-...  +R"'f-  (') 

ri=0 

anti  the  energy  conservation 

||/f  =  ^|<  7?" >|=  +  ||7?'"/||^  (8) 

ii=(i 

Oite  can  also  prove  |l]  that 

lim  117?'"/!  =0.  (9) 

Ml  4-oc 

This  iterative  procedure  can  In?  interpreted  as  a  shape-gain 
vector  quantization  in  a  very  high  dimensional  sitace  anil  is 
also  equivalent  to  a  projection  pursuit  [2],  n.sed  in  statistics. 

A  matching  pursuit  can  be  calculated  with  a  last  algorithm 
[1]  that  is  described  in  the  talk.  In  the  case  of  a  time-frequency 
dictionary  of  Gabor  function,  the  signal  is  decomposed  as  a 
sum  of  time-frequency  elements  whose  scale,  position  and  fre¬ 
quency  match  the  time-frequency  structures  of  the  signal.  Ap¬ 
plications  to  noise  removal  have  been  develojjed  [1]  anil  we 
are  currently  using  this  representation  for  music  analysis.  For 
image  processing,  we  have  constructed  a  dictionary  of  two- 
dimensional  Gabor  waveforms  with  an  orientation  selectivity. 
Decomposition  of  images  and  ap]j|ication  to  noise  removal  will 
be  ilemonst rated. 
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1.  Introduction 

In  this  paper  we  describe  a  probabilistic  framework  for  op¬ 
timal  multiresolution  processing  and  analysis  of  spatial  phe¬ 
nomena.  Our  developed  Multiresolution  (MR)  models  are  use¬ 
ful  in  describing  random  processes  and  fields.  The  scale  recur¬ 
sive  nature  of  the  resulting  models,  leads  to  extremely  efficient 
algorithms  for  optimal  estimation  and  likelihood  calculation. 
These  models,  described  below,  have  also  provided  a  frame¬ 
work  for  data  fusion,  and  produced  new  solutions  to  problems 
in  computer  vision  (optical  flow  estimation),  remote  sensing 
(oceanography  where  dimensional  complexity  is  in  thousands), 
and  various  inverse  problems  of  mathematical  physics. 

II.  Recursive  MR  Models 

The  stochastic  models  that  form  the  focus  for  our  work 
are  defined  on  a  tree  T,  where  we  use  the  index  t  to  denote 
a  general  node  of  the  tree.  In  our  context  the  nodes  of  the 
tree  are  organized  into  levels  or  resolutions,  corresponding  to 
different  resolutions  of  representation  for  the  phenomenon  of 
interest.  In  particular,  we  can  think  of  the  nodes  on  the  tree  as 
2-tuples,  (m(t),n(t)),  where  m(t)  denotes  the  scale  of  the  node 
tandn(t)  the  spatial  location  corresponding  to  that  node.  In 
describing  images  or  2-D  signals,  m{t)  and  n{t)  may  be  vectors 
themselves,  describing  scales  and  translational  locations  in  the 
two  coordinate  directions. 

The  models  of  interest  in  our  work  are  scale-recursive 
Markov  models  on  T.  Specifically,  let  0  denote  the  root  node 
of  the  tree  (i.e.,  the  single  node  at  the  coarsest  scale),  let  t'f 
denote  the  parent  of  node  t,  and  let  tai,  •  •  •  ,tap  denote  the 
descendants  of  t  (where  in  general  the  number  of  descendants 
may  vary  from  node  to  node).  Then  the  model  is  given  by 

x(t)  =  A{t)x{t'y)  -I- 

where  w{t)  is  a  zero-mean,  unit  variance,  white  process  on 
T  which  is  independent  of  x(0)  and  A(t)  andR(t)  are  matrices 
that  may  (and  frequently  do)  vary  with  t  or,  at  least,  m{t). 
For  example,  the  modeling  of  process  with  particular  scaling 
laws,  such  as  fractals,  typically  involve  the  use  of  noise  gains 
that  decrease  geometrically  with  scale. 

Defined  in  this  way,  x(f)  is  obviously  a  Markov  random  field 
on  T,  and,  moreover,  given  the  value  of  x(t),  the  values  of  x(-) 
on  the  numerous  disjoint  subtrees  extending  from  the  node  t 
are  mutually  independent.  It  is  this  fact  that  leads  directly 
to  efficient  algorithms  for  multiresolution  signal  and  image 
analysis.  Specifically,  consider  the  following  set  of  multiscale 
measurements: 

y{t)  =  C{t)x{t)  +  v{t), 

^This  was  supported  in  part  by  the  Army  Research  Of¬ 
fice  (DAAL-03-92-G-115),  Air  Force  Office  of  Scientific  Re¬ 
search  (F49620-92-J-2002)  and  National  Science  Foundation  (MIP- 
9015281). 


where  v{t)  is  a  zero-mean,  unit  variance  white  noise  process 
independent  of  x{t).  Note  that  this  model  allows  for  measure¬ 
ments  at  multiple  resolutions  and  also  allows  for  nonstation¬ 
ary,  sparse  and  irregular  measurements  (in  which  case  C[t) 
certainly  varies  with  t  and  is  zero  except  at  selected  nodes 
at  which  measurements  are  available).  This  scale-recursive 
model  for  x{t)  and  the  associated  measurement  model  for 
y(t)  admit  an  extremely  efficient  algorithm  for  estimating  x(t) 
throughout  the  tree  given  all  of  the  available  data.  The  fine- 
to-coarse  processing  step  computes  the  optimal  estimate  at 
each  node  given  the  measurement  at  that  node  and  at  nodes 
in  its  descendent  subtree.  Being  highly  parallelizable  and, 
thus  well-  matched  to  hypercube  architectures,  the  algorithm 
is  still  extremely  efficient  even  on  a  serial  machine. 

Note  that  the  total  number  of  nodes  in  the  tree  is  a  rather 
small  multiple  of  N  (27V  for  dyadic  trees  used  for  1-  D  signals 
ajid  (4/3)7V  for  quadtrees  frequently  used  in  image  processing) 
and  the  total  computational  complexity  of  the  estimation  al¬ 
gorithm  is  0(7V)  -i.e.,  in  image  processing  problems,  it  has 
constant  per-pixel  computational  complexity  independent  of 
image  size  while  producing  estimates  at  a  full  set  of  resolu¬ 
tions. 

III.  Applications 

The  importance  of  these  algorithms  is  further  marked  by  the 
wealth  of  physical  phenomena  and  applications  whose  mod¬ 
els  are  fraught  with  a  computational  complexity  which  could 
otherwise  be  prohibitive.  These  algorithms  have  been  very 
successfully  applied  to  the  problem  of  “optical  flow”  estima¬ 
tion  from  image  sequences  and  where  the  smoothness  penalty 
corresponded  to  a  prior  fractal  model.  In  addition,  a  per¬ 
formance  similar  to  that  of  exact  MRF  likelihood  calculation 
has  also  resulted  even  in  problems  where  nonstationary  phe¬ 
nomena  were  present.  Our  latest  applications  of  these  meth¬ 
ods  involved  very  high  dimensional  Oceanography  problems 
where  the  processing  efficiency  of  sparse  altimetry  data  from 
Topex/Poseidon  satellite  resulted  in  maps  of  of  sea  level  vari¬ 
ations  along  with  error  statistics. 

Finally,  an  area  in  which  we  believe  our  methods  should 
be  particularly  well-matched,  is  that  of  image  reconstruction 
and  inverse  problems  in  which  blurred,  integrated,  or  indirect 
measurements  of  a  random  field  are  to  be  used  in  order  to 
estimate  the  field  or  to  perform  other  tasks  such  as  texture 
discrimination,  anomaly;  detection,  etc.  In  particular  we  ap¬ 
ply  the  multiresolution  modeling  methods  ealier  developed,  to 
the  problems  of  modeling  the  statistical  variability  of  synthetic 
aperture  radar  (SAR)  imagery  and  then  using  these  models 
for  the  discrimination  of  targets  from  clutter.  We  demonstrate 
that  statistical  fluctuations  are  well-  captured  by  models  of  the 
type  that  we  have  described,  with  significant  differences  be¬ 
tween  the  models  for  clutter  and  for  targets,  both  in  the  model 
parameters  and  in  the  statistics  of  the  scale-to-scale  detail  pro¬ 
cess  w(t)  (which  is  Gaussian  for  targets  and  log-  Rayleigh  for 
clutter). 
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Abstract  —  Approximation  and  estimation  bounds 
were  obtained  by  Barron  (1992,  1993  and  1994)  for 
function  estimation  by  single  hidden- layer  neural  nets. 
This  paper  will  highlight  the  extension  of  his  results 
to  the  two  hidden-layer  case.  The  bounds  derived  for 
the  two  hidden-layer  case  depend  on  the  number  of 
nodes  Ti  and  T2  in  each  hidden-layer,  and  also  on  the 
sample  size  N.  It  will  be  seen  from  our  bounds  that  in 
some  cases,  an  exponentially  large  number  of  nodes, 
and  hence  parameters,  is  not  required. 

1.  Introduction 

A  single  hidden-layer  feedforward  sigmoidal  network  is  a 
family  of  functions  frlx)  of  the  form 

T 

fT{x,6)  -  ■x-bi),x  £  R‘‘ 

i=l 

parametrized  hy  6  =  (ai,bi,Ci)^i  with  internal  weight 
vectors  Oj  in  R'^,  internal  location  parameter  6,  in  R, 
external  weights  Cj,  and  <!>  any  sigmoidal  function  with 
distinct  finite  limits  at  -l-oo  and  —00.  Such  a  network  has 
d  inputs,  T  hidden  nodes  and  a  linear  output  unit.  It 
implements  the  ridge-function  (f>(ai  -  x  —  bi)  on  the  nodes 
in  the  hidden  layer.  The  network  model  can  be  used  to 
approximate  target  functions  f{x)  defined  over  bounded 
subsets  of  R'^  and  to  estimate  the  function  based  on  data 
(Xj,  ,  a  random  sample  from  a  joint  probability  dis¬ 
tribution  Px,Y  with  fix)  =  E\Yi\Xi  =  x]. 

This  presentation  will  be  concerned  with  extensions  for 
approximation  and  estimation  bounds  for  two  hidden- 
layer  sigmoidal  networks.  Such  a  network  takes  the  form 

Ti  Ta 

/Ti,T2(a;,0)  =  'Y^Ci(t)(^aji(f>iu;ji  •  x  -I-  bji)  -  di),x  £  R‘‘ 

i=l  j=l 

There  are  Ti  nodes  in  the  outer  layer  and  T2  nodes  in 
the  inner  layer,  giving  a  total  of  Ti  +  T1T2  nodes.  It  is 
parametrized  by  0  = 

II.  Approximation  Bounds 
The  approximation  bound  for  the  single  hidden-layer  case 
was  already  obtained  by  Barron  (1992  and  1993)  for  func¬ 
tion  estimation  by  single  hidden-layer  neural  nets.  This 
paper  will  highlight  the  extensions  to  the  two  hidden-layer 
case.  We  will  show  that  by  using  a  family  of  two  hidden- 
layer  neural  nets  to  approximate  a  target  function,  we 
are  able  to  approximate  some  classes  of  functions  that 
are  not  known  to  be  approximable  by  single  hidden-layer 
neural  nets.  Barron’s  (1992)  L2  approximation  bound 
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was  OiCf^hs/Vr)  where  T  is  the  number  of  nodes,  and 
is  the  variation  of  the  target  function  with  respect 
to  the  half-spaces.  We  show  that  two  hidden-layer  nets 
can  accurately  approximate  functions  that  have  bounded 
variation  with  respect  to  larger  classes  of  sets.  For  exam¬ 
ple,  if  the  target  function  /  has  bounded  variation  with 
respect  to  a  class  of  ellipsoids,  then  the  L2  approximation 
error  is 

11/  -  /t„t.||2  <  (1) 

where  V  depends  on  the  variation  property  of  the  target 
function  and  K  depends  on  the  curvature  of  the  ellipsoids, 
when  such  a  function  is  approximated  by  a  two  layer  neu¬ 
ral  net  with  Ti  nodes  in  the  outer  layer  and  T2  nodes  in 
the  inner  layer.  The  indicator  of  a  ball  is  an  example  of  a 
function  that  apparantly  cannot  be  approximated  accu¬ 
rately  by  a  single  hidden-layer  net  (with  a  linear  output 
unit)  but  is  approximated  well  with  two  layers. 

III.  Estimation  Bounds 

In  deriving  the  estimation  bound,  the  target  function  is 
assumed  to  be  estimated  from  the  data  (X,,  ,  a  ran¬ 

dom  sample  of  size  N  from  a  joint  probability  distribution 
Px,Y  with  /(x)  =  E[yi|X,  =  x].  Barron’s  (1994)  result 

q2 

for  the  single  hidden- layer  case  was  Oi-^)  +  0{^logN), 
where  d  in  the  dimension  of  the  input,  N  is  the  sample 
size  and  T  is  the  number  of  nodes.  In  our  extension  to  the 
two  hidden-layer  case,  the  overall  mean  squared  estima¬ 
tion  error  in  terms  of  the  best  approximation  error,  the 
dimension  of  the  parameter  space  mx,  .Tj  and  the  sample 
size  N  is  bounded  by 

0(11/  -  /Ti.Tzlli)  +  OirriT^  N /N)  (2) 

In  (2),  the  first  term  is  obtained  from  (1).  It  can  be  seen 
from  our  bounds  that  in  some  cases,  an  exponentially 
large  number  of  nodes,  and  hence  parameters,  is  not  re¬ 
quired.  Complexity  regularization,  and  a  calculation  of 
an  index  of  resolvability,  as  in  Barron  (1994),  is  used  in 
the  derivation  of  our  estimation  bound. 
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Abstract  -  For  a  loss  of  lock  indicator  using  an 
up/down  counter,  the  probabilities  of  true  and  false 
declarations  of  in-lock  and  out-of-lock  are  calculated. 

I.  Introduction 

In  many  data  communication  receivers  up/down  counters  are 
used  as  a  critical  part  of  the  processing  to  determine  whether 
the  symbol  timing  and/or  carrier  phase  tracking  phase-locked 
loops  are  in-lock  or  out-of-lock,  and  it  is  necessary  to  calculate 
the  various  probabilities  for  true  and  false  indications  of  in-lock 
or  out-of-lock.  A  random  walk  along  a  line  (which  is  viewed  as 
a  Markov  chain)  is  an  exact  model  of  an  up/down  counter.  The 
random  walk  has  N  states,  and  in  this  apphcation  one  end  is  a 
partially  reflecting  barrier,  and  the  other  end  is  an  absorbing 
barrier  or  sink.  Previously  published  analyses  have  focused  on 
finding  the  average  time  to  make  a  declaration  and  its  variance. 
In  this  paper  we  concentrate  on  finding  the  probabilities  of 
making  a  true  or  a  false  declaration  within  a  certain  number  of 
symbol  intervals  or  within  a  certain  length  of  time. 

II.  Calculation  OF  Probabiuties 

Two  different  approaches  are  required  to  calculate  the  desired 
probabilities.  The  first  is  through  the  transfer  function  of  the 
equivalent  signed  flow  graph  of  the  random  walk  [1],  and  the 
second  is  by  means  of  the  diagonal  form  of  the  tridiagonal  state 
transition  matrix  [2]  that  has  been  found  to  have  distinct 
eigenvalues.  Since  the  random  walk  has  a  finite  number  of 
states,  the  tremsfer  function  is,  of  course,  rational.  In  many 
published  results  of  this  type,  the  expression  for  the  transfer 
function  has  a  removable  singularity,  but  here  we  give  explicit, 
general  expressions  for  the  numerator  and  denominator 
pol5momi£ds  (without  common  roots)  of  the  treuisfer  functions. 
Since  the  numerical  factors  in  all  the  polynomizd  coefScients 
are  integers,  they  can  be  readily  and  exactly  calculated.  For  a 
general  random  walk  along  a  line,  the  denominator  of  the 
transfer  function  satisfies  a  second-order  difference  equation 
whose  solution  is  the  genereJ  pol3momial  mentioned  above. 

The  probabihties  of  interest  are  given  by  the  coefficients  in  the 
power  series  expansion  of  the  treuisfer  function  about  Z  =  0. 


The  coefficient  ofthem**' power  of  Z  is  the  probability  ofbeing 
at  the  chosen  position  in  exactly  m  steps.  It  is  also  given  by  a 
certain  element  in  the  m**'  power  of  the  state  transition  matrix 
of  the  Markov  chain.  Often  one  is  interested  in  the  cumulative 
probability  or  the  probability  ofbeing  at  a  certain  position  or 
state  in  any  number  of  steps  less  than  or  equal  to  m.  This  is 
given  by  the  coefficient  of  the  m*^^  power  of  Z  in  the  power 
series  expansion  of  the  transfer  function  divided  by  1-Z. 

III.  Numerical  Calculations 

Calculation  of  these  probabilities  is  a  difficult  numerical 
problem  when  the  number  of  states  in  the  random  walk  is 
greater  than  10  or  so  and/or  the  number  of  steps  is  in  the 
hundreds.  The  difficulty  is  compounded  when  the  number  of 
steps  is  in  the  hundreds  of  thousands  or  milhons,  and  there  are 
practical  situations  where  this  is  required.  For  the  number  of 
states  up  to  100  and  the  number  of  steps  up  to  500,  it  has  been 
found  that  the  power  series  expansion  capability  of 
Mathematica  does  an  excellent  job  in  calculating  the 
probabilities,  which  are  produced  as  exact  fractions  when  the 
state  transition  probabilities  are  read  in  as  fractions.  For 
situations  requiring  hundreds  of  thousands  of  steps,  the 
eigenvalue  expansion  or  diagonal  form  of  the  state  transition 
matrix  has  been  used  with  some  success  to  compute  powers  of 
this  matrix.  However,  with  the  double  precision  subroutines 
available  for  making  this  expansion,  the  generated  orthogonal 
eigenvector  matrix  is  often  so  close  to  being  singular  that  its 
required  inverse  cannot  be  calculated  reliably;  thus  this 
approach  breaks  down.  At  this  time  it  is  unknown  whether  the 
singular  nature  is  caused  by  numerical  imprecision  or  whether 
it  is  inherent  in  the  problem  for  some  values  of  state  transition 
probability.  However,  the  former  is  suspected.  Several 
numerical  examples  are  given. 
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Abstract  —  An  efficient  iterative  MMSE  algo¬ 
rithm  that  estimates  the  parameters  of  exponentially 
damped  sinusoids  embedded  in  Gaussian  noise  is  pro¬ 
posed. 

I.  Introduction 

In  many  engineering  and  scientific  problems  the  observed 
measurements  are  modeled  as  exponentially  damped  sinusoids 
distorted  by  additive  noise.  A  difficult  but  interesting  prob¬ 
lem  has  always  been  the  estimation  of  the  nonlinear  parame¬ 
ters  of  these  signals,  the  frequencies  and  the  damping  factors. 
There  have  been  a  variety  of  approaches  for  estimation,  most 
of  them  revolving  around  the  maximum  likelihood  (ML)  prin¬ 
ciple.  Here  we  propose  a  method  that  yields  the  minimum 
mean  square  estimates  (MMSE)  of  the  frequencies  and  damp)- 
ing  factors  with  all  the  remaining  parameters  of  the  model 
being  considered  nuisance. 

II.  Problem  Statement 

We  assume  that  an  A  x  1  data  vector  y  represents  m  expo¬ 
nentially  damped  sinusoids  embedded  in  white  Gaussian  noise. 

In  particular,  y  is  given  by  y  =  Ha-|- w  where  H  is  an  A  x  2m 
matrix  whose  columns  span  the  signal  space,  a  is  a  vector  of 
amplitudes,  and  w  a  noise  vector  with  w  A/’(0,<r*I).  The 
matrix  H  is  defined  by 

H  —  [sic  Sis  S2c  S2s  *  *  *  Smc  Sms] 

Sfcc  =  [1  6““*  cos(2ir/]i;)  •••  cos(2x/fc(A  —  1)] 

=  [0  e"®*  sin(2x/jfe)  •••  sin(2x/*(A  —  1)]. 

AU  the  signal  parameters  are  unknown  as  is  the  noise  variance 

.  Given  the  observations  y,  the  objective  is  to  estimate  the 
nonlinear  parameters  fk  and  a*,  A:  =  1,  2,  ...m,  of  the  signals. 

III.  MMSE  Estimator 

Let  the  unknown  frequencies  and  damping  factors  be  de¬ 
noted  by  f  and  a,  respectively.  The  MMSE  estimates  are 
given  by 

f  =  /  f  p(f,  o  I  y)dodf ,  a  =  [  a  p{f,a  \y)d{da  (1) 

Jf,a  Ja.f 

where  p(f ,  a  |  y)  is  the  a  posteriori  probability  density  func¬ 
tion  of  the  frequencies  and  damping  factors.  Note  that  the 
amplitudes  and  the  noise  variance  <r^  have  been  integrated 
out  analytically.  The  integrals  (1)  are  2m-dimensional,  and  ^ 
as  such,  would  require  reliance  on  numerical  techniques  for 
high  dimensional  integration.  An  alternative  is  to  resort  to  an 
iterative  approach  similar  in  philosophy  to  the  expectation- 
maximization  [1]  and  alternating  projections  [2].  To  be  more 
specific,  let  i  denote  the  current  iteration,  and  fj'\  the 
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current  estimates  of  fj  and  Oj,  j  =  l,2,-",m.  Then,  if  we 
approximate  the  a  posteriori  density  p(f ,  a  |  y)  by 

m 

p(f.“  I  y)^p{fk,ak  I 

i=i 

i^k 

our  2m-dimensional  integrals  would  reduce  to  2-dimensional 
integrals.  and  oi^k)  denote  the  estimates  at  the  i-th 

iteration  of  all  the  frequencies  and  damping  factors  except 
the  ones  of  the  ib-th  signal.  For  example,  the  2-dimensional 
integrals  for  the  frequencies  have  the  form 

=  I  h  p(/*-  I  c^Pk))do‘kdfk.  (2) 

•'/k.Ojt 

The  integrals  for  the  damping  factors  are  similar  to  (2).  The 
method  is  based  on  solving  integrals  such  as  (2)  until  conver¬ 
gence  of  the  estimates  is  achieved. 

IV.  Simulation  Result 

In  the  computer  experiment  we  generated  two  damped  si¬ 
nusoids  in  noise.  The  amplitude  vector  was  equal  to  = 
[1  0  1  0],  the  normalized  frequencies  were  /i  =  0.16  and 
/2  =  0.26,  and  the  damping  factors  ai  =  0.2  and  02  =  0.1. 
The  SNR  was  varied  between  5  and  20  dB  in  steps  of  1  dB. 
For  each  SNR  we  simulated  100  realizations.  The  two  dimen¬ 
sional  integration  was  carried  out  by  an  adaptive  importance 
sampling  technique  from  [3].  The  results  for  fi  and  02  are 
shown  in  Figures  1  and  2,  respectively.  Similar  results  were 
obtained  for  /i  and  oi.  In  each  figure,  the  solid  line  repre¬ 
sents  the  Cramer-Rao  (CR)  bounds,  and  the  other,  the  mean 
squared  error  of  our  estimates. 
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Abstract  -  Edge  location  in  compound  Gauss-Markov 
random  fields  is  formulated  as  a  parameter  estima¬ 
tion  problem;  since  the  number  of  parameters  is  un¬ 
known,  a  minimum-description-length  (MDL)  crite¬ 
rion  is  proposed. 

I.  Introduction 

Compound  Gauss-Markov  random  field  (CGMRF) 
models  allow  for  edge-preserving  Bayesian  image  restora¬ 
tion/reconstruction  using  continuous  (Gaussian)  statistical 
models  together  with  a  binary  (hidden)  edge  field  [1].  The 
CGMRF  approach  to  simultaneous  edge  detection  and  im¬ 
age  restoration  involves  two  random  fields:  one  (intensity 
field)  representing  the  image  to  be  restored  and  another  one 
signaling  edge  elements.  To  perform  joint  maximum  a  pos¬ 
teriori  (MAP)  estimation  of  both  the  image  and  its  edges, 
some  prior  model  has  to  be  specified  for  the  edge  process. 
This  prior  is  usually  not  explicitly  stated;  instead,  a  joint 
intensity-edge  prior  is  directly  considered  [1],  [2],  [3]. 

Our  approach  does  without  the  specification  of  any  prior 
for  the  line  process  by  adopting  a  new  perspective:  we  inter¬ 
pret  edge  locations  as  (deterministic  but  unknown)  param¬ 
eters  of  the  original  image  prior  model.  Locating  edges  is 
then  a  parameter  estimation  problem  with  a  salient  feature: 
unknown  number  of  parameters  (edges).  This  fact  places  the 
problem  in  a  class  to  which  Rissanen’s  minimum  description 
length  (MDL)  principle  has  been  successfully  applied  [4]. 

We  propose  an  MDL-type  edge  location  criterion  for 
image  restoration  based  on  a  CGMRF  model;  it  contains 
no  edge-related  parameters,  such  as  detection  penalty,  which 
appear  (and  have  to  be  specified)  in  other  types  of  models. 

II.  The  MAP  Estimate  and  the  CGMRF  Model 

Let  X  be  a  noncausal  CGMRF,  modelling  the  original 
image  to  be  estimated,  and  y  a  linear  observation  (LO)  of 
X,  contaminated  by  additive  white  Gaussian  noise  (AWGN). 
Let  1  be  the  (hidden)  binary  edge  field  (line  process);  its 
elements,  placed  on  an  interpixel  dual  grid,  indicate  whether 
bonds  between  elements  of  x  are  broken  or  not.  What  is 
usually  sought  for  is  the  joint  MAP  estimate  of  x  and  1,  given 
y,  which  is  the  mode  of  p(x,  l|y)  or  of  p(x,y|l)  p(l).  Here, 
p(y,  xjl)  is  the  joint  PDF  of  x  and  y  given  a  certain  edge 
configuration,  and  p(l)  the  prior  of  the  line  process.  Notice 
that  1  can  be  seen  as  a  parameter  of  p(x|l)  or  of  p(x,  y|l); 
under  the  CGMRF-LO-AWGN  assumptions,  these  are  both 
Gauss  proability  density  functions  which  depend  on  1. 

III.  The  Minimum  Description  Length  Principle 

The  MDL  principle  generalizes  the  maximum  likelihood 
(ML)  criterion  to  cases  where  a  parameter  vector  6  of  un¬ 
known  dimension  A:  is  to  be  estimated  [4].  The  (joint)  MDL 
estimate  of  k  and  0,  given  observed  data  z,  is 

(k,  9)  =  argmin  {—  logj  p{^\9,  k)  -|-  L{6\k)  +  L{k)}  ,  (1) 

k,e 


where  X(^|ik)  and  L{k)  are,  respectively,  the  code  lengths  for 
6  (given  that  it  is  fc-d^ensional)  and  for  k  itself;  for  further 
details  see  [4]  and  the  references  therein.  As  usual,  L{k)  is 
here  considered  constant  and  dropped. 

IV.  Proposed  Approach 

To  abandon  any  prior  assumption  (expressed  in  p(l)) 
about  the  edge  field,  we  interpret  edge  locations  as  priorless 
parameters  of  p(x|l). 

Let  the  locations  of  all  (say  k)  signaled  edges  be  col¬ 
lected  in  a  Jb-dimensional  parameter  vector  9.  Writing  p(x|l) 
is  equivalent  to  writing  p(x|tf ,  k)  since  9  is  just  a  compact  code 
for  1.  In  a  first  order  model  [1],  [2],  [3],  and  taking  M  x  N 
size  images,  we  need  log2(MW)  bits  to  code  each  edge  loca¬ 
tion  plus  1  bit  to  distinguish  horizontal  from  vertical  edges. 
Accordingly,  L(9\k)  =  fclog2(2AfiV'). 

In  the  presence  of  both  x  and  y,  the  MDL  estimates  of  k 
and  9  could  be  obtained  by  considering  z  =  (x,  y)  and  insert¬ 
ing  p(x,y|tf,k)  =  p(y|x)p(x|tf,  i)  (notice  that  p(y|x,tf,A:)  = 
p(y|x))  plus  the  parameter  code  length  L(9\k)  into  the  MDL 
criterion  (1).  This  yields  an  MDL  estimation  criterion  for  k 
and  9  (i.e.  the  number  of  edges  and  their  locations): 

(Jk,  fl)  =  argmin  {fclog2(2MW)  —  log2P(x,yltf,  k)}  •  (2) 

k,e 

The  criterion  specified  by  (2)  has  an  intrinsic  difficulty 
lying  in  the  fact  that  x  is  not  observed  (is  missing);  i.e.  it 
can  be  classified  as  MDL  parameter  estimation  from  incom¬ 
plete  data.  To  deal  with  (2),  we  have  developed  a  modified 
version  of  the  expectation-maximization  (EM)  algorithm  [5]; 
further  details  are  presented  in  [6].  Although  it  is  a  subop- 
timal  scheme,  the  results  obtained  show  the  ability  of  the 
proposed  criterion  to  adapt  to  the  image  edge  structure  [6]. 
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Abstract  —  A  maximum  entropy  formulation  leads  to  a  neural 
network  which  is  factorable  in  both  form  and  function  into 
individual  neurons  corresponding  to  the  Hopfield  neural  model. 
A  maximized  mutual  information  criterion  dictates  the  optimal 
learning  methodology  using  locally  available  information. 

1.  INTRODUCTION 

A  biological  model  is  developed  here  in  which  neurons 
computationally  realize  a  multi-dimensional  hypothesis  testing 
function  implemented  on  a  neural  field  Q={qi  •••  ^m}  of 
propositions  yeB^,  jB={0,1},  MeZ  which  become  defined 
through  learning.  Each  neural  output  y;  describes  a  conjunctive 
component  of  a  compound  proposition  ^=yi‘y2'y3'”-‘yM  posed 
to  observed  input  originating  from  arbitrary  sources.  Answers 
represent  decisions  which  are  in  turn  provided  as  individual 
neural  output  indications  (action  potentials).  Learning  within  the 
neural  field  is  realized  by  the  definition  of  the  elemental 
propositions  which  in  turn  correspond  to  those  propositions 
which  serve  to  maximize  the  channel  capacity  between  input 
ensembles  and  the  ensemble  of  recalled  states.  The  operational 
objective  of  the  field  corresponds  to  the  search  for  global 
minima  of  a  quadratic  energy  function  EQ=Ee|  which 
parameterizes  the  a  posteriori  Gibbs  distribution  as  conditioned 
on  the  input  vector  cue.  The  negative  of  the  respective  neuron 
energies  -e-,  correspond  to  the  statistical  evidence  using  in 
determining  the  probability  of  generating  an  action  potential. 

11.  MAXIMUM  ENTROPY  (ME)  FORMULATION 

It  is  assumed  that  the  neural  field  Q  is  capable  of 
extracting  information  from  external  inputs  and  its  own 
outputs  y  through  a  set  of  sampling  functions  F.  The  field  Q 
uses  the  sampled  data  to  estimate  moments  on  the  defined 
sampling  functions  which  it  in  turn  uses  to  form  the  joint  ME 
distribution  P(x,y).  As  such,  this  probability  is  a  property  of 
the  observing  ensemble  Q  as  it  should  be  since  probability  is 
deemed  to  be  a  property  of  the  observer  [1].  The  computed 
moments  serve  to  realize  P(x,y)  as  a  unique  network  ME 
distribution  or  equivalently  a  Gibbs  distribution  parameterized 
by  synaptic  connection  weights  which  are  in  fact  the  Lagrange 
multipliers  for  the  ME  distribution. 

III.  MAXIMIZED  MUTUAL  INFORMATION  (MMI) 

The  Gibbs  Mutual  Information  Theorem  as  derived  in 

[2]  is  applied  to  the  composite  network  distribution  which  then 
serves  to  constrain  the  network  architecture  and  signal 
processing  required  to  approximate  the  MMI  criterion  between 
the  input  ensemble  x  and  the  output  ensemble  y. 

The  use  of  an  MMI  criterion  serves  to  optimize  the 


storage  capacity  of  the  network  which  is  given  by  the  entropy 
H(y)  of  the  network  storage  capacity.  H(y)  has  a  theoretical 
maximum  of  M  bits  which  can  asymptotically  be  obtained  using 
the  MMI  criterion.  This  value  can  change  once  dynamical 
considerations  are  included.  Degradation  of  the  storage 
capacity  due  to  input  noise  is  not  considered  here,  but  has  been 
considered  elsewhere.  Degraded  decipherability  of  the  input 
code  by  an  observer  of  the  output  neural  code  trying  to  guess 
the  input  code  can  be  attributed  to  either  noise,  the  many-to-one 
compression  imposed  by  the  condition  that  N>M,  or  both. 

IV.  RESULTS 

Together  ME  and  MMI  lead  to  a  neural  field  which  is 
factorable  in  both  form  and  function  into  component 
computational  entities  which  correspond  to  the  Hopfield  neuron 
model  ([3])  including  decision  threshold,  action  potential 
realization,  Hebbian  learning,  sigmoidal  transfer  characteristic, 
•and  conditionalized  principal  component  analysis  using  a  simple 
modification  of  an  equation  originally  described  by  Oja  [4]. 

A  diffusion-based  search  scheme  using  Langevin’s 
equation  in  conjunction  with  the  neural  field  energy  Eq  is  shown 
to  lead  to  the  FitzHugh-Nagumo  neuron  activation  model. 
Synchronous  activation  patterns  observed  in  biological 
assemblies  of  neurons  can  then  be  described  as  an  asymptotic 
periodic  Markov  chain  realized  through  a  Gibbs-sampler 
computational  paradigm.  Quantitative  details  of  this  process  are 
alluded  to. 
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Abstract  —  We  provide  a  characterization  of  Gauss 
Markov  random  fields  in  terms  of  partial  differential 
equations  with  random  forcing  term.  Our  method 
consists  of  obtaining  a  concrete  representation  of  an 
abstract  stochastic  partial  differential  equation  using 
some  results  from  the  theory  of  vector  measures. 

I.  Preliminary  Background 
To  fix  notations,  let  Yu,u  G  K  {K  compact  subset  of  ii")  be 
a  random  field  and  let  D  be  an  open  subset  of  K.  Let  P  be 
the  boundary  of  D.  Then  it  is  well  known  that  the  field  Yu  is 
Markov  with  respect  to  D  if 

E[y„y.  I  ^(P)]  =  E[K  I  ^(r)]E[y„  |  <r(P)]  (i) 

where  u  e  D  and  v  €  O'"  and  <r(P)  is  the  usual  germ-field 
given  by 

o-(P)  =  n{iT(0);0  open  and  O  D  P}  (2) 

Thus  D  and  D'^  are  conditionally  independent  given  knowl¬ 
edge  of  the  boundary. 

For  Gaussian  fields  ,  conditioning  on  <t{A)  {  A  C  K)  is 
projection  onto  the  closed  subspace  generated  by  Yu, «  €  A, 
instead  of  the  larger  subspace  of  all  functions  measurable 
with  respect  to  cr(A)  (see  [2]).  Hence,  the  Markov  property  can 
be  formulated  in  terms  of  projection  on  these  smaller  Hilbert 
spaces.  We  introduce  the  spaces 

H{K)  =  closed  subspace  generated  by  y„,M  S  JP  (3) 

and  the  corresponding  reproducing  kernel  Hilbert  space  7i{K). 
It  is  well  known  that  H(K)  and  'H{K)  are  isometrically  iso¬ 
morphic  through  the  mapping,  J :  H (K)  i-+  'W(  JP) 

JY{t)  =  EYYt  t  €  iP.  (4) 

II.  Sample  Path  Characterization 

We  assume  that  C^(l'P)  is  dense  in  H{K).  For  u,v  E 
CS°{K),  we  can  write  the  inner  product  of  ‘H{K)  in  the  form 

<  u,v  —  (F«,  «)x,2  (5) 

where  P  is  a  differential  operator  written  in  the  divergence 
form  (see  [3]). 

In  order  to  derive  a  sample  path  characterization  we  use 
the  well  known  technique  for  associating  a  generalized  random 
field  $  to  our  ordinary  random  field  y  through  the  following 
formula 

i{4>)  =  I  Y{u,Lo)(j>{u)du.  (6) 

Jk 

A  generalized  random  field  can  be  regarded  as  a  linear  op¬ 
erator  from  Co°  to  a  space  of  random  variables.  With 
every  generalized  field,  there  is  an  associated  dual  field.  The 

^This  work  was  partially  supported  by  an  ONR  Grant  #  N00014- 
91-JlOOl 


dual  field  is  also  a  linear  operator  from  to  random 
variables  such  that 

EK(«)r(^)]=  /  u{t)v{t)dt  (7) 

Jk 

Kallianpur  and  Mandrekar  [5]  has  shown  that  an  ordinary  ran¬ 
dom  field  is  Markov  if  and  only  if  the  associated  generalized 
random  field  is  Markov.  This  enables  us  to  study  the  general¬ 
ized  random  field  and  then  transfer  back  its  properties  to  the 
associated  ordinary  random  field. 

Now,  the  generalized  field  is  Markov  if  the  dual  field  is 
local,  (see  [1])  in  the  sense  that  if  suppw  n  suppu  =  (j),  then 
E(f*(u)£*(u))  =  0.  Locality  of  the  dual  field  implies  that 
E[V*(u)^*(u)]  =  {Pu,v)^2  where  P  is  the  same  differential 
operator  associated  with  inner  product  of  the  RKHS  of  the 
ordinary  random  field  (see  equation  (5)). 

We  know  [1]  that  £  satisfies  the  following  abstract  equation 

$(Pu)=r(«).  (8) 

We  show  that  the  dual  of  the  generalized  field  is  intimately 
related  to  J~^  {J  is  defined  in  equation  (4)).  Under  some 
integrabOity  condition  we  further  show  that  the  mapping  J~^ 
is  weakly  compact.  A  weakly  compact  mapping  from  i^(M) 
to  a  Hilbert  space  is  Riesz  representable  (see  [4]).  Therefore, 
we  can  write  equation  (8)  in  the  following  weak  form 

Y{t,u)Pu{t)dt  =  j  €{t,w)u{t)dt  (9) 

where  u{i)  6  C^{K). 

When  the  support  of  u  is  in  D,  a  subset  of  K ,  we  relate 
^*(u)  to  minimum  mean  square  error.  In  particular,  we  show 
that  f*(u)  lies  in  the  closed  subspace  generated  by  (Tu  —  E[y,  | 
(T(r)]),  u  E  D.  This  provides  a  canonical  description  which  is 
analogous  to  the  one  provided  by  Woods  [6]  in  the  context  of 
Gauss  Markov  random  fields  on  lattices. 
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ABSTRACT 

Discriminatory  power  is  the  relative  usefulness  of  a 
feature  for  classification.  Traditionally,  feature-selection 
techniques  have  defined  discriminatory  power  in  terms  oj 
a  particular  classifier.  Non-parametric  discriminatory 
power  allows  feature  selection  to  be  based  on  the 
structure  of  the  data  rather  than  on  the  requirements  o} 
any  one  classifier.  In  previous  research,  we  have  defined 
a  metric  for  non-parametric  discriminatory  power  called 
relative  feature  importance  (RFI)  [1].  In  this  work,  we 
explore  the  construction  of  RFI  through  closed-form 
analysis  and  experimentation.  The  behavior  of  RFI  is 
also  compared  to  traditional  techniques. 

Relative  feature  importance  ranks  features  based  on  an 
estimate  of  their  relative  potential  for  class  separation.  A 
set  of  optimal,  orthogonal  features  is  extracted  from  each 
possible  feature  subset,  in  order  to  estimate  the  potential 
for  separation  contained  in  the  subset.  The  separation 
between  class-conditional  joint  feature  distributions  is 
measured  in  the  transformed  space.  The  contribution  of 
each  original  feature  to  the  separation  in  the  transformed 
space  is  estimated.  Note  that  the  separation  contributed  is 
relative  in  the  sense  that  the  use  of  the  other  features  in 
the  subset  is  taken  into  account. 

Because  the  features  may  not  be  independent,  the 
method  first  must  determine  the  optimal  subset  of 
features.  The  optimal  subset  of  original  features  is  the 
smallest  subset  that  yields  maximal  separation  in  the 
transformed  space.  Features  outside  the  optimal  subset  are 
assigned  an  RFI  of  zero.  The  features  within  the  optimal 
subset  are  ordered  by  their  estimated  contribution.  The 
rank  of  a  feature  is  its  RFI. 

Some  critical  design  choices  for  RFI  are:  the  feature 
extraction  technique,  the  measure  of  separation  in  the 
transformed  space,  and  the  technique  used  to  estimate  the 
contribution  of  the  original  features  to  separation  in  the 
transformed  space. 

Rather  than  calculating  RFI  based  on  separation 
between  the  class-conditional  joint  feature  distributions  of 
the  original  features,  the  method  uses  a  non-parametric 
feature  extraction  technique  and  calculates  separation  in  the 
transformed  space.  Our  extraction  technique  is  based  on 
Fukunaga  and  Mantock’s  non-parametric  discriminant 
analysis  [2].  Briefly,  the  data  is  expanded  in  the 
eigenvectors  of  the  ratio  of  the  within-class  to  the  between 


-class  scatter  matrices.  RFI  uses  non-parametric  variations  of 
within-class  and  between-class  scatter  matrices  (which  use  local 
k-nearest-neighbor  density  estimates)  and  differentially  weights 
samples  based  on  their  distance  from  the  discriminant  boundary. 
Since  many  traditional  feature-ranking  techniques  are  based  on 
the  marginal  distributions  of  the  original  features,  our  examples 
include  several  multi-cluster  experiments  which  cannot  be 
solved  using  the  marginals,  but  can  be  solved  using  extraction. 

The  contribution  of  the  original  features  to  separation  in  the 
transformed  space  is  estimated  using  the  Weighted  Absolute 
Weight  Size  (WAWS)  [1].  WAWS  combines  information  from 
the  magnitudes  of  the  eigenvectors  (which  measure  the 
contribution  of  the  original  features  to  the  extracted  features) 
and  the  normalized  eigenvalues  (which  measure  the  amount  of 
separation  in  the  transformed  space  contributed  by  each  extracted 
feature). 

Through  closed-form  analysis  and  a  series  of  experiments,  we 
explore  several  design  choices  in  the  non-parametric  feature 
extraction  algorithm.  We  compare  the  algorithm’s  performance 
for  Euclidean  vs.  Mahalanobis  distance  calculations,  and  for 
parametric  vs.  non-parametric  scatter  matrices  for  both  within- 
class  and  between-class  scatter.  We  consider  several  algorithms 
for  calculating  the  non-parametric  scatter  matrices  and  for 
measuring  separation  in  the  transformed  space.  Each  variant  of 
the  algorithm  is  evaluated  according  to  several  criteria. 

A  metric  for  non-parametric  discriminatory  power  is 
important  for  a  number  of  reasons.  First,  applications  exist 
where  optimizing  the  performance  of  an  artificial  classifier  is 
the  final  goal:  physicians  need  to  know  which  test  is  the  most 
accurate  predictor  of  a  particular  disease  whether  or  not  they 
wish  to  use  a  classifier  system  as  a  diagnostic  aid.  Second, 
non-parametric  discriminatory  power  can  be  used  to  direct 
feature  search,  without  first  having  to  select  a  classifier. 
Finally,  a  classifier,  when  desired,  can  be  selected  based  on  the 
distributional  structure  of  the  high-ranking  features.  All  of 
these  activities  are  currently  undertaken  in  an  ad-hoc  fashion  by 
humans  using  mapping  and  projection  techniques. 
Unfortunately,  such  techniques  are  of  limited  utility  in  high¬ 
dimensional  spaces. 
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Sci.  Pub.,  in  press. 

[2]  K.  Fukunaga  and  J.  Mantock,  “Nonparametric  discriminant 
analysis,”  IEEE  Trans.  Pattern  Anal.  Machine  Intell., 
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Abstract  —  We  find  a  formula  for  Shannon’s- 
Hartley’s  Entropy  Ratio  of  a  text  governed  by  the 
Zipf  law.  The  formula  is  in  a  good  agreement  with 
real  texts.  It  is  asymptotically 


1/2  + 


loglog\D\ 
log\D\  ’ 


|£)|  being  the  size  of  the  text’s  dictionary,  lim  |D|  = 
oo.It  means  that  a  word  to  variable  length  code  might 
significantly  outperform  a  word  to  fixed  length  one 
for  large  dictionaries  only. 


I.  Theoretical  Considerations 

Let  T  be  a  text,  D  be  the  set  of  all  words  of  T  (dictionary),p(t) 
be  the  frequency  of  the  i-th  word,  i  =  1,...,  |£)|. According  to 
the  Zipf  law, 

p{i)  =  A/i,  (1) 

A  =  const,  i  =  1, ...,  |jD|. 


11.  Experiments 

We  have  the  following  experimental  results  up  to  now. 

1. Collected  works  of  Russian  poet  A.S.Pouschkin.  |D|  = 
21197  words,  Theoretical  Ratio  =  0,710.  Real  Ratio 


H 


iog\D\ 


0,726. 


2. The  book  of  R.E.Krichevskii  ’’Universal  Compression  and 
Retrieval”,  Kluwer  Publishers.  |D|  =  1900  words,  Theoretical 

Ratio  77^  =  0-  =  0, 80. 

3. The  book  of  Y.G.Reschetnjak  '’Space  Mappings  with 

Bounded  Distortion”,  Providence,  R.I.  |I1|  =  2000  words  The¬ 
oretical  Ratio  icg\D\  =  0)  Roal  Ratio  =  0, 77. 

4.  A  paper  from  Siberian  Mathematical  Journal.  \D\  =  2100 


Theoretical  Ratio 


H 

lo9|D| 


=  0,78.  Real  Ratio 


H 

Io9\D\ 


words 
0, 79. 

5.Another  paper  from  Siberian  Mathematical  Journal.  \D\ 
=  1600  words,  Theoretical  Ratio  =  0, 79.  Real  Ratio 


H 


Io9\D\ 


=  0, 76. 


Obviously, 


>=i 


(2) 


.  The  well-known  formula  for  the  harmonic  sum  and  (2)  yield 


^  ln\D\  +  Cl 

where  ci  =  0,  577...  is  the  Euler  constant.  Per  word  Hartley 
entropy  of  T  equals  log  \D\.  It  is  the  cost  of  a  per  word  fixed 
length  encoding  of  T  (to  within  an  additive  constant).  Per 
word  Shannon  entropy  of  T  equals 


H  =  -  i)logp{i)  (4) 

1=1 

.  It  is  the  cost  of  a  per  word  variable  length  encoding  ofT  (to 
within  an  additive  constant).  By  the  Euler-Maclaurin  relation 
between  sums  and  integrals  we  obtain 


fi^  =  ^  +  c.  +  o(l),  (5) 


where  C2  =  0,  211...  i.Prom  (1)  and  (3)-(5)  we  get  for  the  ratio 
of  entropies 

H  Cl  ,_i  ,  Zn(Iw|D| +ci)  ,  C2fa2 

loq\D\~^2^^ln\D\’  ^  ln\D\  ln‘^\D\  A- cMD\) 

(6) 

If  lim  log  |D|  =  oo,  then 

H  1  lnln\D\ 
log|D|“2‘^  ln|D| 
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Abstract  — 

A  length  n  block  code  C  of  size  2"^  over  a  finite 
alphabet  Xo  is  used  to  encode  a  memoryless  source 
over  a  finite  alphabet  X.  A  length  n  source  sequence 
X  is  described  by  the  index  i  of  the  codeword  xo(*)  that 
is  nearest  to  x  according  to  the  single-letter  distortion 
function  do{x,xo).  Based  on  the  description  i  and  the 
knowledge  of  the  codebook  C,  we  wish  to  reconstruct 
the  source  sequence  so  as  to  minimize  the  average 
distortion  defined  by  the  distortion  function  di{x,xi), 
where  di{x,xi)  is  in  general  different  from  doix,xo).  In 
fact,  the  reconstruction  alphabets  Xo  and  Xi  could  be 
different. 

We  study  the  minimum,  over  all  codebooks  C,  of 
the  average  distortion  between  the  reconstructed  se¬ 
quence  xi(i)  and  the  source  sequence  x  as  the  block- 
length  n  tends  to  infinity.  This  limit  is  a  function  of 
the  code  rate  R,  the  source’s  probability  taw,  and  the 
two  distortion  measures  do{x,xo),  and  di{x,xi). 

This  problem  is  the  rate-distortion  dual  of  the  prob¬ 
lem  of  determining  the  capacity  of  a  memoryless  chan¬ 
nel  under  a  possibly  suboptimal  decoding  rule. 

The  performance  of  a  random  i.i.d.  codebook  is 
found,  and  it  is  shown  that  the  performance  of  the 
“average”  codebook  is  in  general  suboptimal.  The  re¬ 
sulting  distortion  can  in  general  be  improved  by  con¬ 
sidering  i.i.d.  codebooks  of  m-tuples.  It  is  shown  that 
as  m  tends  to  infinity,  the  performance  of  the  “aver¬ 
age”  codebook  becomes  optimal. 

By  studying  the  special  case  of  a  Gaussian  source 
and  minimum  Euclidean  Distance  description,  i.e. 
do{x,Xo)  =  {x  —  io)^)  we  obtain  an  improved  upper 
bound  on  the  rate  distortion  function  for  a  Gaussian 
source  and  an  arbitrary  distortion  measure. 

By  exploring  the  analogy  between  the  rate  distor¬ 
tion  problem  and  the  mismatched  channel  decoding 
problem,  we  find  that  for  an  i.i.d.  real-valued  source 
of  second  moment  a  random  Gaussian  codebook 
of  size  2"^  achieves,  for  sufficiently  large  n,  an  average 
mean-square-error  distortion  of  irrespective  of 

the  source  distribution. 
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Abstract  —  The  underlying  mathematical  problem 
of  both,  SNR  estimation  and  blind  equalization,  is  the 
sum  of  random  processes.  It  can  be  shown  that  it  is 
sufficient  to  describe  the  random  processes  as  well  as 
their  sum  by  a  shape  factor  of  the  p.d.f.,  the  kurtosis, 
which  includes  the  second  and  fourth  order  moment. 

I.  Introduction 

In  the  following  we  concentrate  on  discrete  time  signals  and 
systems,  although  the  algorithms  in  principle  are  applicable 
to  analog  systems  as  well. 

A  discrete  time  random  process  is  a  sequence  of  identically 
distributed  random  variables  (r.v.)  x,,.  If  the  random  process 
is  complex,  each  r.v.  consists  of  real  and  imaginary  part  x  = 
Xr  +  jx,  with  joint  p.d.f.  /i(xr,  Xi). 

II.  Formulation  of  the  Problem 
The  SNR  estimation  problem  can  be  described  as  follows: 
Given  a  wanted  signal  random  process  (r)  with  (joint)  p.d.f. 
fr(rr,ri)  and  a  (statistically  independent)  noise  random  pro¬ 
cess  (n)  with  p.d.f.  /n(nr,  ni),  estimate  the  signal-to-noise  ra¬ 
tio  .SAR  =  Sr/S„  just  by  observation  of  the  sum  process 

(y)  =  (r)  -b  (n).  Usually  the  type  of  p.d.f.  of  the  wanted  sig¬ 
nal  (e.g.  M-PAM,  M-PSK,  etc.)  and  the  noise  (e.g.  Gaussian) 
are  known,  while  in  general  neither  received  signal  power  nor 
noise  power  are  known. 

Blind  equalization  means  to  find  a  filter 

(c)  =  Co,Cl,...,C/Vc-l 

for  an  unknown  system  (channel)  {h)  such  that  the  overall 
impulse  response  (s)  =  (c)  o  (h)  =  1,  where  “o”  denotes 
discrete  convolution.  “Blind”  means  that  only  the  output 

(z)  =  (x)  0  (s)  is  observable,  while  the  input  sequence  (x)  is 
unknown.  Of  course  the  knowledge  of  some  statistical  proper¬ 
ties  of  (x)  is  necessary.  More  specifically  we  demand  that  (x) 
be  an  i.i.d.  sequence  with  known  kurtosis. 

Hence  the  output  random  process  is  a  weighted  sum  of  a 
number  of  i.i.d.  input  rzuidom  processes  (x);^: 

(z)  =  ^a*(x)*  (1) 

k 

The  parallelism  to  the  SNR  estimation  problem  is  obvious. 

III.  The  Sum  of  Random  Processes  (Variables) 
AND  THE  Kurtosis 

We  assume  all  input  random  processes  to  be  complex  val¬ 
ued,  stationary  and  zero-mean  with  existing  characteristic 
function  and  moments  up  to  the  fourth  order.  Moreover,  the 
channel  input  (x)  must  be  i.i.d.. 

The  {k  -f  f)-th  order  moment  of  a  complex  random  process 
(r)  is  defined  by 

oo  oo 

mr(k,£)=  J  j  rjrf/r(rr,r<)drrdri.  (2) 

—  OO  — OO 


It  can  be  shown  that  the  (Jk  -F  f)-th  order  moment  of  the 
sum  (y)  =  (r)  -F  (n)  can  be  expressed  as 

=  (^)  (!)  "*r(k  -  tt,  f  -  «)m„(ti,  v).  (3) 

usO  11=0  ^  ^  ' 

We  define  the  kurtosis  of  a  random  process  (r)  as  he  ratio 
of  the  fourth  order  to  the  squared  second  order  expected  value 

A'.  =  E{rrr•r•}/(F{rr*})^  (4) 

Replacing  expected  values  by  moments  gives: 

_  mr(4, 0)  -F  2mr(2, 2)  -F  mr(0, 4) 

(mr(2,0)-Fmr(0,2))" 

Using  (2),  (3)  and  (5)  we  can  express  the  kurtosis  Ky  of  (y) 
as: 

Ky-Ka  =  (Kr  -  Kg)  +  (1  -  'i)‘^{K„  -  Ka),  (6) 

where  k  =  5/(5  -F  W)  is  the  wanted  signal  power  to  total  re¬ 
ceived  power  ratio  and  Ka  is  the  kurtosis  of  a  Gaussian  pro¬ 
cess.  Eq.  (6)  is  the  motivation  to  define  the  Gauss  unlikeness 

Gr  =  Kr-  Ka,  a.  [1] 

IV.  SNR  Estimation 

The  algorithm  to  estimate  the  SNR  is  quite  simple  now.  Gr 
and  Gn  are  known  Rom  the  used  modulation  scheme  and  the 
expected  noise  type,  resp.  Then  observe  the  received  signal 
(y)  and  compute  an  estimate  for  Gy  by  averaging  in  the  time 
domain.  Finally  solve  (6)  for  k  and  the  estimated  SNR  is 
SNR=  k/{1-k). 

V.  Blind  Equalization 

Extending  (6),  the  Gauss  unlikeness  G*  at  the  output  of 
the  equalizer  is  given  by 

G,  =  G. 

|G,|  is  always  less  or  equal  to  \Gx\,  with  equality  if  and  only 
if  either  (x)  is  Gaussian  or  (s)  =  1.  The  former  case  cannot 
be  solved  by  any  means,  and  the  latter  describes  a  perfectly 
equalized  channel.  Hence  it  is  sufficient  to  maximize  |Gx|,  e.g. 
using  a  stochastic  gradient  algorithm  (see  [2]). 
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Abstract  —  Given  N  strongly  mixing  observations 
{Ai,  we  estimate  the  regression  function  f'{x)  = 

E\Yi\Xi  =  i],  I  G  3?''  from  a  class  of  neural  networks, 
using  certain  minimum  complexity  regression  estima¬ 
tion  schemes.  We  establish  a  rate  of  convergence  for 
the  integrated  mean  squared  error  between  the  pro¬ 
posed  regression  estimator  and  /*. 

I.  Introduction 

Let  {Xi,  yi}iS_oo  be  a  stationary  process  such  that  Xi 
takes  values  in  3?“*  and  Yi  takes  values  in  9i.  Given  N 
observations  drawn  from  {A;,  yi},“_oo>  we  are 

interested  in  postulating  an  estimator  based  on  single  hid¬ 
den  layer  sigmoidal  networks  for  the  regression  function 
/•  =  £[ya|X:  =  x],  I  G  SR''. 

Recently,  assuming  that  the  underlying  random  variables 
{Xi,yi},“_oo  are  i.i.d.,  Barron  [1]  proposed  a  minimum  com¬ 
plexity  regression  estimator  based  on  single  hidden  layer  sig¬ 
moidal  networks.  Moreover,  supposing  that  Assumption  1  (see 
below)  holds  he  established  a  rate  of  convergence  for  the  inte¬ 
grated  mean  squared  error  between  his  estimator  and  /*.  In 
this  paper,  we  extend  Barron’s  results  from  i.i.d,  random  vari¬ 
ables  to  stationary  strongly  mixing  [3]  processes.  The  reader 
is  referred  to  the  full  paper  [2]  for  complete  analysis. 

II.  A  Class  of  Target  Regression  Functions 
AND  Single  Hidden  Layer  Sigmoidal  Networks 

Assumption  1.  Assume  that  (a)  y  takes  values  in  some 
interval  I  =  [a,  a  -t-  i]  C  5R  a.s.;  (b)  Xi  takes  values  in 
B  =  [—1,1]''  a.s.;  and  that  (c)  there  exists  a  complex  valued 
function  f  on  3?“'  such  that  for  x  &  B,  we  have 

r(x)-r(0)=  [  {e"^-^ -l)f{w)dw 

Jdt<‘ 

and  that  ||ru|li|/(«))|  dw  <  C  <  oo  for  some  known 
C' >  0.  Set  C  =  ma.x{l,C'}. 

Let  :  3i  — >  3i  denote  a  sigmoidal  function  such  that 
-  l{u>o}l  <971“!'’  P  >  0,  g'  >  0,  and  for 

all  ti  G  3f  \  {0}.  Set  g  =  max]!,?'}.  For  n  >  1,  let 
7„  =  n(£i  -f  2)  4- 1.  For  0  <  i  <  n,  let  a  G  3i;  for  1  <  i  <  n,  let 
oi  G  3?''  and  let  i.  G  3?.  We  define  a  7„-dimensional  parameter 
vector  as 

0^  ^  =  (fll ,  fl2  ,  .  .  .  ,  an]  ^1  j  1*2  ,  .  .  .  j  bn  j  Co ,  Cl ,  .  .  .  ,  Cn)  . 

Now,  define  a  single  hidden  layer  sigmoidal  network 
fg(n)  :  3i‘'  — <•  3?  parametrized  by  6^"^  as 

n 

=  Co  +  ^  Ci  il>{ai  •  X  +  hi),  X  e  5?"^.  (1) 

1=1 

^This  work  was  supported  by  the  Office  of  Naval  Research  under 
Grant  N00014-90-J-1175. 


Set  rUn  =  2  7  ?p  and  define  5^”^  C  3?’’'’“  as 

n 

10^"^  :  Co  G  |ci|  <  2C,  max  ||ai|li  <  w„,  max  |6i|  <  n7„}. 

'  l<i<n 

t=l 

For  each  fixed  n  and  N  and  given  an  en,N  >  0,  we  construct 
an  e„,iv-net  of  namely,  Tn,N  such  that 


lncard(Tn,N)  <  7n  In  =  Ln.iv, 

Cn,N 


where  card(r„,iv)  denotes  the  cardinality  of  the  set  Tn,N. 


III.  Estimation  Scheme  and  Main  Result 

Let  a{j)  denote  the  strong  mixing  coefficient  [3]  corre¬ 
sponding  to  the  process  t  =  — OO  * 

Assumption  2.  Assume  that  the  strong  mixing  coefficient 
satisfies  a(j)  —  aexp(—cj^),j  >  1,  a  6  (0, 1],^  >  0,  c  >  0. 

Write  In  =  In  plays  the  same  role  in 

our  analysis  as  the  sample  size  N  in  the  i.i.d.  case.  Define 

S'’'' 

where  for  a  given  0  G  Tn,N,  fe  is  defined  as  in  (1).  Now,  for 
each  fixed  regularization  constant  A  >  0,  define  n  =  njv  as 

argmin  ]  1  ,  /xr^^2  ,  ^  Ln,N  ■+•  21n(n -k  1)  ( 

!<„<,,  I  ivy  - iii  j’ 

and  define  the  minimum  complexity  estimator  as  f^^ 

Theorem  1.  Suppose  Assumptions  1  and  2  hold.  Let 
A  >  56^/3  and  for  some  r  >  1/2  let  (nlN)~'^  <  Cn.N  < 
then 

^  ^  (a^SSst)  ’  (2) 

where  Px  denotes  the  marginal  distribution  of  Xi . 

Note  that  the  exponent  of  N  in  (2)  does  not  depend  on 
the  dimension  d.  In  [2],  we  compare  the  rate  of  convergence 
obtained  in  Theorem  1  to  the  rate  of  convergence  achieved  by 
the  classical  nonparametric  kernel  estimator  in  similar  setting 
and  to  the  rate  of  convergence  obtained  by  Barron  [1]  in  the 
i.i.d  setting.  In  [2],  we  also  establish  a  result  analogous  to 
Theorem  1  for  m-dependent  observations. 
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Abstract  —  This  work  describes  an  EM-based  ap¬ 
proach  to  multi-user  detection  that  treats  the  signals 
of  interfering  users  as  hidden  data.  We  consider  a  new 
algorithm  based  on  the  space- alternating  generalized  EM 
(SAGE)  algorithm  appropriate  for  estimation  of  discrete 
random  parameters,  and  we  use  it  to  derive  rapidly- 
convergent  nearly  optimum  multi-user  receivers. 

I.  Introduction 

The  expectation-maximization  (EM)  algorithm  [1]  provides  an 
iterative  approach  to  parameter  estimation  when  direct  max¬ 
imization  of  the  likelihood  function  may  be  infeasible.  The 
recently  proposed  SAGE  algorithm  [2]  modifies  EM  to  update 
only  a  subset  of  the  parameter  components  at  each  iteration, 
thereby  allowing  the  use  of  less-informative  hidden  data  in 
order  to  improve  convergence  rates.  We  consider  a  new  al¬ 
gorithm  based  on  the  SAGE  structure  that  incorporates  the 
statistics  of  the  parameter  components  not  currently  being  up¬ 
dated.  Our  motivation  is  the  problem  of  multi-user  detection 
[3],  for  which  the  vector  parameter  b  corresponds  to  the  bi¬ 
nary  data  of  several  users  in  a  CDMA  system.  The  complexity 
of  optimum  decisions  for  b  (under  either  Bayesian  or  ML  cri¬ 
teria)  is  exponential  in  the  number  of  users  [3]  and  motivates 
the  development  of  simpler  iterative  receivers. 

Modifying  the  SAGE  algorithm  to  treat  parameter  compo¬ 
nents  not  currently  being  updated  as  hidden  data  leads  to  a 
new  “hidden-parameter”  EM  (HPEM)  algorithm.  In  terms  of 
the  observation  Y  and  vector  parameter  b  with  joint  density 
/(y,  b),  the  ith  iteration  of  the  HPEM  algorithm  is  described 
by  the  following  steps: 

•  Set  Jfe  =  1  -b  (imodJr).  Let  bj;  =  :  j  /  k}. 

Choose  the  hidden  data  x. 

•  E-step:  Compute 

g(tfc;b‘)  =  y'log/(y,x,bfc  I  tfc)  fc(x,bj;;y,b‘)  dxdbj  (1) 

•  M-step:  b)^^  =  arg  max  (3(4;,;  b‘),  bV^'^  = 

In  (1),  the  integrating  density  h  is  given  by  either  the  con¬ 
ditional  density  |  y,4fc)  or  the  product  of  conditional 

densities 

/(x  I  y,b‘)n„^fc/(6m  I  y.bJh).  (2) 

For  the  former  case,  we  have 

Q(4*;b')  =  E{log/(y,x,bj  |  bk)  |  y,4fc  =  bl}  , 

which  is  essentially  a  smoothed  version  of  the  SAGE  E-step 
objective  function.  This  algorithm  produces  an  estimate  se¬ 
quence  that  is  non-decreasing  in  the  marg'tna/ likelihoods 

log  /(y  I  bk  =  b'l^^)  >  log  /(y  I  bk  =  b'k)  (3) 

for  all  fe  =  1, . . . ,  if,  where  /(y  |  bk)  =  E{f{y,h  |  4*))}. 

This  research  was  supported  by  the  U.S.  Army  Research  Office  un¬ 
der  Grant  DAAH04-93-G-0219  and  by  an  AT&T  Ph.D.  FeUowship. 


Intuitively,  the  latter  case  (2)  involves  conditioning  the 
statistics  of  each  hidden  parameter  4m  on  current  estimates 
for  all  parameter  components  except  4m,  rather  than  just  on 
bk  =  4),.  Under  mild  conditional  independence  assumptions, 
the  algorithm  with  h  given  by  (2)  also  produces  an  estimate 
sequence  satisfying  (3). 


II.  Receivers  for  “Hidden  Interferers” 

The  received  signal  in  the  if-user  synchronous  CDMA 
channel  is  described  by  Y  ~  N(RAb,  cr^R),  where  R  is 
a  positive-definite,  symmetric  matrix  of  signature  waveform 
cross-correlations,  A  is  a  diagonal  matrix  of  the  users’  sig¬ 
nal  amplitudes,  and  b  G  {±1}^  is  the  users’  data  over  one 
bit  interval.  The  observation  Y  corresponds  to  the  sampled 
outputs  of  filters  matched  to  the  users’  signature  waveforms. 

In  applying  the  HPEM  algorithm  to  multi-user  detection, 
we  assume  b  is  distributed  equiprobably  on  {±1}^  and  con¬ 
sider  the  scenario  when  R,  A,  and  cr^  are  known.  The  re¬ 
sulting  receiver  cyclically  updates  estimates  for  the  K  users’ 
bits.  With  respect  to  iterations  that  update  bk,  the  E-step 
computes  soft-decision  estimates  of  the  interference: 


bm  =  E  {4m  I  y,  brti  =  bj^}  =  tanh(^(ym  -  ^  Rmjajb))) 

for  aU  m  /  k.  The  M-step  cancels  the  estimated  interference; 
Zk  ~  yk  —  Em^fc  RkmO'mbm, 

and  updates  the  estimate  for  bk  via  b)^^  =  sgn(zfc).  Alterna¬ 
tively,  one  could  model  the  parameter  bk  as  taking  values  in 
the  interval  [— c,  c].  In  this  case,  the  M-step  update  is 

4i+i  _  f  csgn{zk/ak)  Zfc/ofc^[-c,c] 

*  {  Zk/nk  Zfc/ofc  G  [— c,  c]. 


The  HPEM  receiver  has  a  structure  similar  to  multi-stage 
receivers  [4],  but  it  enjoys  some  unique  and  significant  conver¬ 
gence  and  performance  advantages  due  to  the  non-decreasing 
likeUhood  of  the  estimate  sequence  b*.  These  are  verified  by 
theory  and  simulation. 

One  might  also  consider  application  of  the  SAGE  algorithm 
to  multi-user  detection.  Depending  on  the  M-step  update  non¬ 
linearity,  the  resulting  receivers  provide  either  an  iterative  im¬ 
plementation  of  the  decorrelating  detector  [3]  or  a  convergent 
multi-stage  receiver  using  sequential,  rather  than  simultane¬ 
ous,  updates  of  the  bit  estimates. 
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Abstract  —  The  problem  of  recovering  band- 
limited  signals  from  noisy  data  is  considered. 
Whittaker-Shannon  cardinal  expansions  based 
estimates  involving  sampling  windows  and 
truncation  of  higher  frequencies  are  introduced. 
Weak  and  strong  pointwise  convergence  properties 
of  the  proposed  estimates  are  derived. 

I.  Introduction 

Consider  the  problem  of  recovering  a  function 
f(t),  when  only  the  noisy  measurements  generated 
by  the  following  model  are  available 

yk  =  f(kx)  +  Eic,  k  =  0,  ±1,  ±2,  ...,  some  x  >  0, 
and  8k  represents  noise. 

The  objective  of  this  paper  is  to  examine  the 
statistical  properties  of  reconstruction  schemes  for 
f(t)  which  is  band-limited,  i.e.,  which  Fourier 

transform  has  support  within  (-i2,  Q),  is  a  finite 
number  called  the  bandwidth  of  f.  Such  a  function 
can  be  represented  by  the  so-called  cardinal  series 
due  originally  to  Whittaker  and  Shannon 


f(t)=  2  f(kx)  sinc(^  (t  -  k  X)),  (1) 

k  =  -oo 

uniformly  in  any  bounded  interval  of  R,  provided 
that  X  <  tr/Q,  where  sinc(x)  =  sinx/x. 

II.  Estimation  Techniques 

The  representation  in  (1)  forms  a  namral  basis  for 
our  estimation  techniques.  We  construct  a  class 
of  recovering  algorithms  of  the  following  form 

fv,,(t;x,  8)  =  X  yjKsCt  -  jx),  (2) 

ljl<n 

where  Ks(t)  =  \|/(5Qt),  0<z<k/Q, 

Kt 


8>0,  \|/  is  a  band-limited  function  with  the 

bandwidth  equals  1  and  i|/(0)  =1.  For  6  — >0  the 
estimate  in  (2)  takes  the  form  of  the  kernel 

convolution  estimate  with  the  kernel  sin(Qt)/7tt. 
This  clearly  represents  a  truncated  and  smoothed 
version  of  the  expansion  in  (1). 

The  pointwise  properties  of  the  proposed  estimates 
are  established  winch  includes  the  consistency  and 
convergence  rate.  In  particular,  it  is  shown  that 

for  a  certain  choice  of  the  parameters  x,  5  and  the 
function  \\i  the  mean  squared  difference  between 

fv|/(t;x,  8)  and  f(t)  can  tend  to  zero  as  fast  as  0(  ) 

uniformly  in  t  and  for  any  f  in  the  band-limited 
class. 
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Abstract  —  Consistency  and  rates  of  convergence 
of  the  fcn-NN  estimator  are  established  in  the  gen¬ 
eral  case  in  which  samples  are  chosen  arbitrarily  from 
a  compact  metric  space. 

I.  Introduction 

Nearest  neighbor  (NN)  estimation  has  received  much  atten¬ 
tion  since  it  was  studied  in  [1].  Most  existing  work  generally 
considers  the  case  in  which  the  observations  are  drawn  i.i.d. 
from  a  probability  distribution.  Some  authors  (e.g.,  [2,  4]) 
have  discussed  rates  of  convergence  of  kn-NN  estimators.  In 
[3],  we  formulated  a  new  estimation  problem  in  which  the  ob¬ 
servations  need  not  be  drawn  randomly  but  can  be  arbitrarily 
selected.  We  investigated  convergence  of  the  NN  rule  in  this 
framework  and  found  its  convergence  rate.  In  this  paper,  we 
consider  the  consistency  of  the  kn-NN  estimator  under  arbi¬ 
trary  sampling. 

II.  Problem  Formulation  and  Preliminaries 
Let  y  =  R'  (for  some  positive  integer  a)  with  inner-product 
induced  norm  ||  ■  ||  and  let  X  be  a  metric  space  with  metric 
p  which  we  denote  {X,p).  Given  a  point  Xn  S  X,  &  random 
variable  yn  is  drawn  with  unknown  conditional  probability 
distribution  F(yn|iBn)-  We  are  asked  to  produce  an  estimate, 
yn,  of  the  value  of  yn  with  the  goal  of  minimizing  ||j/n  -  yn||^. 
If  F(yTi|*n)  is  known,  the  estimate  that  gives  the  minimum 
possible  expected  loss  is  the  Bayes  estimate,  r*(sn).  If  is 
chosen  arbitrarily,  the  minimum  worst  possible  expected  loss 
is  the  sup  over  Xn  of  the  conditional  Bayes  risk,  JZ'**. 

We  consider  the  problem  in  which  the  only  knowl¬ 
edge  of  F(y|a:)  is  that  inferred  from  pairs  of  samples 
(21,  yi).  ■  •  ■ .  (2n-i,  yn-i).  The  A:„-nearest  neighbor  rule  is  de¬ 
fined  as  follows.  Let  kn  be  any  nondecreasing  sequence  of  num¬ 
bers.  Denote  the  kn  nearest  neighbors  of  Zn  as 
where  x^n  is  the  nearest.  We  denote  i/n  as  the  parameter 
associated  with  the  ith  nearest  neighbor.  The  fcn-NN  rule 
estimate  is  the  average  of  the  kn  NN  parameters,  = 

^  2n’, .  ■  ■ ,  Sn"’)  be  the  expected  loss 

using  the  kn  nearest  neighbors  of  Xn-  If  the  Zj’s  are  cho¬ 
sen  arbitrarily,  in  general  one  cannot  get  a  useful  bound  on 
ril‘'"\xn,  . . . ,  xi?”').  Instead  we  prove  an  upper  bound  on 
the  cumulative  risk.  We  define  ci*”^(xi, . . .  ,Xn)  as  the  cumu¬ 
lative  Jtn-NN  loss  and  JZl('”^(xi, . . . ,  x„)  as  the  time-averaged 
risk  of  a  given  arbitrary  sequence. 

We  impose  the  foEowing  Lipschitz  assumption  on  F{y\x). 
Let  m(x)  =  E[y\x]  and  (t^(x)  =  F[||y|Hx]  -  ||m(x)||^  Note 
that  r*(x)  =  cr^{x). 

Assumption  1  There  exists  K,a  >  0  such  that  for  any 

xi,  X2  £  X, 

||m(xi)  —  m(x2)||  <  VKp{xi,X2)°‘ 

^Xiiis  work  was  supported  in  part  by  the  National  Science  Foun¬ 
dation  under  grants  IRI-9209577  cuid  IRI-9457645  and  by  the  U.S. 
Army  Research  Office  imder  grant  DAAL03-92-G-0320. 


|(T^(xi)  -  0'^(X2)|  <  Kp{xi,  X2)^“ 

The  metric  covering  number  A/’(e,A)  of  a  compact  subset  A 
of  (A,  p)  is  defined  as  the  smaUest  number  of  open  baUs  of 
radius  e  that  cover  A.  The  inverse  function,  Af~^(k,A),  the 
metric  covering  radius,  is  the  smaUest  radius  such  that  there 
exists  k  baUs  of  this  radius  that  cover  A. 


III.  Main  Result 

Theorem  1  With  squared  error  loss,  for  any  F{y\x)  satisfy¬ 
ing  Assumption  1,  and  with  any  arbitrary  sequence  xi, . . . ,  Xn 
in  compact  subset  A  of  (X,  p),  we  have  that  any  kn-nearest 
neighbor  rule  satisfies 

Cl'‘”)(xi,...,Xn)< 

(l  +  r*(xO  +  2K  ^  [2Ar-^(Li/fcnJ,^)f  “ 

t3s2 


Jf{kn}  satisfies  (Cl)  jb„  -+  oo  and  (CS)  ^  ^  0  then  (when 
the  limit  exists)  we  have  that 


lim  R[I‘''\xi, 


..,x„)=  Em  ^^’•*(*0 

n-4^00  n  •' 

t=2 


Em  sup  R^*"^(xi, . . . ,  Xn)  =  R™* 

The  first  statement  gives  an  upper  bound  on  the  cumulative 
risk  for  an  arbitrary  sequence  in  terms  of  the  sum  of  the  con¬ 
ditional  Bayes  risks  plus  the  growth  rate.  The  rate  is  indepen¬ 
dent  of  the  sequence  but  is  in  terms  of  an  intrinsic  topological 
quantity  of  the  compact  set  A.  It  quantifies  how  close  ar¬ 
bitrary  points  in  a  compact  set  cluster  together.  The  final 
statement  states  that  the  asymptotic  time-average  of  the  NN 
risk  equals  the  time-average  of  the  conditional  risks  of  the 
particular  sequence.  Also,  the  asymptotic  time-average  of  the 
worst  possible  sequence  equals  the  worst  possible  conditional 
Bayes  risk. 

In  particular,  for  compact  subsets  of  R’',  it  is  weU- known 
that  M~^{n,A)  =  0(n“^^’').  This  gives  a  convergence  rate 
of  0(n“  )  for  the  time-averaged  risk.  This  rate  coincides 

with  rates  estabEshed  [2]  in  the  random  sampEng  case. 
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Abstract  —  In  many  situations  it  is  desirable  to  operate  on  a 
subset  of  the  data  only.  These  can  arise  in  the  areas  of 
experimental  design,  robust  estimation  of  multivariate 
location,  and  density  estimation.  This  paper  will  describe  a 
method  of  subset  selection  that  optimizes  the  determinant  of 
the  Fisher  Information  Matrix  (FIM)  which  is  called  the 
Effective  Independence  Distribution  (EID)  method.  Some 
motivation  will  be  provided  that  justifies  the  use  of  the  EID, 
and  the  problem  of  finding  the  subset  of  points  to  use  in  the 
estimation  of  the  Minimum  Volume  Ellipsoid  (MVE)  will  be 
examined  as  an  application  of  interest. 

I.  Problem  Statement 

The  determinant  of  the  FIM  as  an  objective  function  to  optimize 
arises  in  many  areas  of  statistics  and  engineering.  In  most  cases, 
the  problem  is  one  of  subset  selection  which  can  be  stated  as 
follows:  given  a  data  set  of  size  n,  select  a  subset  of  these  points  of 
size  m,  where  m<n,  such  that  the  determinant  of  the  FIM  is 
optimized.  It  is  assumed  that  each  data  point  is  dimension  p  and 
that  n»p.  Subset  selection  should  not  be  confused  with 
dimensionality  reduction,  where  the  goal  is  to  reduce  the  value  p. 

Current  methods  of  subset  selection  typically  rely  on  random 
methods  [1].  The  problem  with  these  methods  is  that  they  are  not 
guaranteed  to  find  the  global  optimum  with  respect  to  the  objective 
function  in  any  finite  sampling.  Another  undesirable  aspect  of 
these  is  that  the  results  are  not  reproducible  because  they  are  based 
on  randomly  selected  subsets. 

II.  Effective  Independence  Distribution 

The  EID  was  developed  and  used  by  Kammer  [2]  to  choose 
optimal  sensor  locations  for  a  modal  identification  experiment  on 
the  space  station.  The  EID  can  be  calculated  as  the  diagonal 
elements  of  the  following  matrix 

EID  =  diag(X{X^Xr'X^) 

where  X  is  an  n  x  p  data  matrix  with  each  row  containing  one  data 
point,  and  the  FIM  is  given  by 

FIM  =  X^X 

It  can  be  shown  [3]  that  the  following  relationship  holds  between 
the  determinants  of  the  FIM  as  points  are  removed  from  a  data  set 

|xI.X_i|  =  (l-EIDi)|x^x| 


where  X_i  is  the  data  matrix  with  the  i-th  point  removed  and  EED; 
corresponds  to  the  i-th  data  point.  From  this,  it  is  apparent  that  the 
required  optimization  of  the  determinant  of  the  FIM  can  be 
obtained  by  deleting  observations  with  the  appropriate  EID  value. 
Further  theory  motivating  the  use  of  the  EID  method  will  be 
provided  in  the  poster  session. 

III.  Application 

The  MVE  estimator  [1]  is  a  robust  estimator  of  location  and 
covariate  structure.  Determining  the  MVE  consists  of  two  parts: 
1)  finding  the  subset  of  points  to  be  used  in  the  estimate  and  2) 
finding  the  ellipse  that  covers  this  set.  The  EID  method  addresses 
the  first  problem. 

To  test  the  usefulness  of  the  EID  method,  it  is  applied  to  six 
data  sets  where  the  true  MVE  is  known.  The  paper  by  Hawkins 
[4]  gives  the  correct  subset  and  the  resulting  volume  of  the  true 
MVE  for  these  data  sets.  These  are  regression  data,  and  the 
predictors  are  used  to  determine  the  MVE.  The  size  of  the  data 
sets  are  relatively  small,  ranging  from  n=20  to  w=50.  The 
dimensionality  of  the  data  is  2  ^  p  <  5 . 

The  EED  algorithm  is  used  to  determine  the  set  of  points  that 
comprise  the  MVE  estimate.  It  is  implemented  in  MATLAB  on  a 
486,  33MHz  computer,  and  the  time  required  to  find  the  subset  of 
points  ranges  fiwm  0.11  seconds  to  0.77  seconds.  The  relative 
error  in  the  volumes  of  the  minimum  covering  ellipsoid  using  the 
EID  approach  are  less  than  6%  for  these  data  sets. 

rv.  Summary 

In  this  paper,  the  EID  method  of  subset  selection  has  been 
described.  Since  this  is  a  deterministic  method,  the  results  are 
repeatable  for  a  given  data  set  which  is  a  desirable  property.  This 
method  is  relevant  for  a  wide  range  of  applications. 
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Abstract 

This  paper  examines  the  application  of  Akaike 
Information  Criterion  (AIC)  based  pruning  to  the  refine¬ 
ment  of  nonparametric  density  estimates  obtained  via  the 
Adaptive  Mktures  (AM)  procedure  of  Priebe  and  Mar- 
chette.  The  paper  details  a  new  technique  that  uses  these 
two  methods  in  conjunction  with  one  another  to  predict 
the  appropriate  number  of  terms  in  the  mixture  model  of 
an  unknown  density.  Results  that  detail  the  procedure’s 
performance  when  applied  to  different  distributional 
classes  will  be  presented.  Results  will  be  presented  on  arti¬ 
ficially  generated  data,  well  known  data  sets,  and  some 
recently  collected  features  for  mammographic  screening. 


I.  APPROACH 

Given  X=(xi,  X2, Xn)  where  each  x,  is  i.i.d. 
according  to  an  unknown  density  a(x)  then  one  is  often  inter¬ 
ested  in  estimating  a(x).  This  problem  occurs  in  many  areas. 
There  are  a  variety  of  approaches  to  the  multivariate  density 
estimation  problem  [1]. 

An  often  used  parametric  approach  is  that  of  finite 
mixture  models  [2]  in  combination  with  the  expectation  max¬ 
imization  (EM)  method  of  Dempster,  Laird,  and  Rubin  [3]. 
Given  an  unknown  distribution  a(x)  we  seek  to  model  the 
distribution  using  a*(x)  defined  by 

m 

a+(x;'l')  =  '^n.K(x-Xi)  (1) 

i  =  1 

where  K  is  some  fixed  density  parameterized  by  F,  and  'F  = 
(Tti,  Fj ,  7t2,  F2, ...,  Jt:^,  Fn,).  The  Ttj’s  are  referred  to  as  the  mix¬ 
ing  proportions.  (We  can  assume  for  much  of  what  follows 
that  K  is  taken  to  be  the  univariate  normal  distribution,  in 
which  case  Fj  becomes  (Pj,  Oj} .)  One  difficulty  with  the  finite 
mixtures  approach  is  that  one  needs  some  idea  as  to  the 
appropriate  number  of  terms  in  the  mixture  model. 

A  recently  developed  density  estimation  technique 


that  circumvents  some  of  the  problems  of  the  above  technique 
is  the  adaptive  mixtures  density  estimation  (AMDE)  proce¬ 
dure  of  Priebe  [4] .  This  procedure  is  a  blend  of  the  finite  mix¬ 
tures  and  kernel  estimator  approach.  It  is  essentially  a 
mixtures  type  approaches  that  allows  for  the  creation  of  new 
terms  as  indicated  by  the  data  complexity.  It  is  important  to 
note  that  unlike  finite  mixture  models  the  number  of  terms  m 
is  not  fixed  but  is  estimated  from  the  data.  A  problem  with  the 
adaptive  mixtures  procedure  is  that  the  solutions  that  it  pro¬ 
duces  are  typically  overdetermined.  While  being  good  func¬ 
tional  estimate  of  a(x)  they  have  too  many  terms  in  the 
mixture. 

Using  the  Akaike  Information  Criterion  (AIC)  [5]  as 
a  starting  point  we  have  developed  a  procedure  that  uses  a 
single  or  set  of  adaptive  mixtures  density  estimates  and  pro¬ 
duces  a  set  of  pruned  models  with  a  lower  complexity.  This 
procedure  attempts  to  obtain  a  minimum  complexity  model 
by  iteratively  pruning  terms  from  the  original  model.  The 
keys  to  this  approach  are  AIC  based  pruning  of  AMDE  mod¬ 
els  based  on  resampled  data  sets. 

n.  RESULTS 

Results  obtained  using  this  procedure  were  pre¬ 
sented  at  the  poster  session. 
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I.  Introduction 

Ziv-Zakai  bounds  [l]-[4]  on  the  mean-square-error  (MSE) 
in  parameter  estimation  are  some  of  the  tightest  available 
bounds.  These  bounds  relate  the  MSE  in  the  estimation  prob¬ 
lem  to  the  probability  of  error  in  a  binary  hypothesis  test¬ 
ing  problem.  The  original  Bayesian  version  derived  by  Ziv 
and  Zakai  [1],  and  improvements  by  Chazan-Zakai-Ziv  [2]  and 
Bellini-Tartara  [3]  are  applicable  to  scalar  random  variables 
with  uniform  prior  distributions.  This  bound  was  recently  ex¬ 
tended  by  Bell-Ephraim-Steinberg-Van  Trees  [4]  to  vectors  of 
random  variables  with  arbitrary  prior  distributions.  The  goal 
of  this  paper  is  to  present  an  improvement  to  the  vector  ver¬ 
sion  of  [4],  explore  some  properties  of  the  bounds,  and  present 
further  generalizations. 


II.  Improved  Vector  Bound 

Assume  that  a  vector  of  random  variables  ff  with  prior  proba¬ 
bility  density  function  p(e)  is  estimated  from  the  observation 
vector  X.  The  estimation  error  is  defined  as  e  =  B  —  6  and  we 
can  lower  bound  ®i  where  a  is  an  arbitrary  vector, 

by  either  of  the  two  bounds: 


(0  a^E{ee^}a>  f 

Jo  ^ 

(u)  a^E{ec^}a> 

Jo  2 

V  ly  2  min  (p{<p),p{<p  +  6))  :f>  +  S)d<p'^ 


(1) 


(2) 


where  the  vector  6  satisfies 


aJ'S  =  A, 


(3) 

Pminiip,  f+S)  is  the  minimum  probabibty  of  error  in  the  binary 
detection  problem: 


Piv) 


Ho-.  9  =  v>;  Pr(Ho)=  ^ - - 

p(<p)  +p{<p  -I- 

Hi:  =  ^  + «;  Pr(Hi)  =  1  -  Pr(Ho),  (4) 


by  (3).  Generally,  S  should  be  chosen  so  that  the  two  hy¬ 
potheses  are  as  indistinguishable  as  possible  by  the  optimum 
detector.  In  [4],  6  was  chosen  to  be 


This  choice  results  in  the  hypotheses  being  separated  by  the 
smallest  Euclidean  distance.  When  a  is  chosen  to  produce  a 
bound  on  the  MSE  of  the  t‘'*  parameter,  the  resulting  bound  is 
equivalent  to  that  obtained  by  conditioning  the  scalar  bound 
[4]  on  the  remaining  i— 1  parameters,  and  taking  the  expected 
value  with  respect  to  those  parameters.  Choosing  S  according 
to  (5)  does  not  always  lead  to  the  tightest  bound  because  hy¬ 
potheses  separated  by  the  smallest  Euclidean  distance  are  not 
necessarily  the  most  indistinguishable.  A  higher  (p+S) 

can  be  achieved  by  choosing 


S  = 


a  -1-  b 


(6) 


where  b  is  not  a  function  of  p,  and  is  orthogonal  to  a,  thus 
(3)  is  satisfied.  With  this  choice,  the  bound  on  the  MSE  of 
the  parameter  cannot  be  reduced  to  the  expected  value  of 
the  conditional  scalar  bound. 


in.  Further  Extensions 

Other  results  concerning  the  Ziv-Zakai  bound  are  as  follows. 
First,  a  tighter  bound  which  uses  the  probability  of  error  in 
an  Af-ary  hypothesis  testing  problem  is  derived. 

Second,  both  the  binary  and  M-ary  bounds  are  equal  to  the 
minimum  MSE  when  the  posterior  density  of  given  the 
observations,  p(a^^|x),  is  symmetric  and  unimodal.  Further¬ 
more,  in  the  limit  of  no  data,  the  bound  converges  to  the  true 
a  priori  variance  when  the  prior  density  of  a^fl  is  symmetric 
and  unimodal. 


liurdj  for  problems  in  which  some  of  the  parameters  may 
be  considered  random  variables  with  prior  probability  density 
functions,  but  some  are  considered  unknown,  deterministic 
quantities,  the  bound  can  be  extended  to  a  hybrid  version 
combining  the  derivation  leading  to  (1)  and  (2)  with  a  deriva¬ 
tion  similar  to  that  in  [1]  for  non-random  parameters. 

Fourth,  the  bound  can  be  extended  to  any  non-decreasing 
cost  function  of  |a^f|  in  a  straightforward  manner. 


Pminiv,  IP +  5)  is  the  minimum  probability  of  error  in  the  same 
binary  detection  problem  but  with  equally  likely  hypotheses, 
and  V{  }  is  a  valley-filling  function.  Since  probability  of  error 
results  are  easier  to  derive  and  more  plentiful  for  the  equally 
likely  problem,  the  second  bound  may  be  more  useful  compu¬ 
tationally. 

In  applying  the  bounds,  one  has  to  choose  a  and  S.  The 
choice  for  a  is  dictated  by  the  particular  parameter  or  linear 
combination  of  parameters  being  investigated.  If  a  bound  on 
the  MSE  of  the  i‘'‘  parameter  is  desired,  then  a  must  be  the 
unit  vector  with  a  one  in  the  position. 

The  vector  S  determines  the  position  of  the  second  hypothe¬ 
sis  S  =  (p  -f  f.  It  is  constrained  to  lie  in  the  hyperplane  defined 
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Ahstract  -  We  present  a  memory-binding  density 
transformation  as  a  means  of  improving  performance 
of  entropy  coders  acting  on  memoried  sources. 


I.  Introduction 

Reversible  coders  are  often  called  upon  to  operate  on  sources 
with  memory.  Though  Shannon’s  work  suggests  that  coding 
performance  may  be  enhanced  by  encapsulating  memory  in¬ 
formation  in  an  M-dimensional  pdf,  in  many  situations  this 
approach  is  impractical.  Thus,  coders  are  often  forced  to  view 
the  source  as  memoryless  and  attempt  to  encode  near  he 
entropy  of  a  high-entropy  single-symbol  pdf,  rather  than  the 
desired  lower-entropy  multi-symbol 

pose  a  reversible  memory-binding  transform  (MBT)  alterna¬ 
tive  which  improves  performance  by  binding  memory  informa¬ 
tion  from  the  multi-symbol  pdf  into  the  single-symbol  pdf  to 
be  processed  by  the  coder. 

II.  Density  Transformation  Algorithm 

Assume  a  source  sequence  {xi},  x;  €  A,  ~ 

{ai,...,aN}  is  an  alphabet  of  N  symbols.  Memory  in¬ 
formation  associated  with  Xi  is  represented  by  the  vector 
y  -  Xi  iL  For  each  Xi  there  exists  a  mapping  <p,, 

a  permutation  of  A  that  produces  Bi  =  4>iA  -  {/^i, .  • . ,  Pn]- 
This  permutation  is  a  function  of  Xi  that  encapsulates  mem¬ 
ory  of  the  M  symbols  previous  to  Xi  and  may  be  represented 
by  either  a  rule  or  list.  In  either  case,  <ii  has  the  property  that 
the  a  priori  probability,  P(xi  =  /3i  |  Xi),  is  maximum  and 

P(xi  =  /3»  1  Xi)  >  P(xi  =  /J„4i  1  Xi)  for  n  =  L  •  •  • .  A' -  1  • 

The  density  transform  is  defin^^as  /  :  Xi  -A  i/.  w 
Xi,  yi  e  A,  so  that  yi  =  /(xi)  =  En=i 

verse  transform  is  given  by  Xi  =  f~^(yi)  =  S„=i  fin6(y,  -o!n)- 

The  probability  density  functions  associated  with  Xi  and  yi  are 
given  by  p(x)  and  p(y)  respectively. 

III.  Transformation  Characteristics 
Viewed  in  one  sense,  an  MBT  is  a  generalization  of  both  dif¬ 
ferential  and  modulo-PCM  coding.  In  another  sense,  it  is 
conceptually  an  alternative  projection  mechanism  for  produc¬ 
ing  a  low-entropy  low-dimensionality  pdf  from  a  low-entropy 
high-dimensionality  pdf  with  memory.  In  a  third  sense,  from 
an  encoder  perspective,  it  may  also  be  viewed  as  a  tr^sfor- 
mation  from  p(x)  to  p{y)  which  binds  memory  information  to 
the  symbols  forming  the  domain  y.  From  any  perspective,  an 
MBT,  appropriately  inserted  between  a  source  and  co  er,  is  a 
mechanism  for  increasing  coder  performance  by  reducing  the 
entropy  of  the  source  pdf.  For  many  sources  of  interest,  such 
as  imagery,  E{p(y)}  <  E{p(x))  where  E{»}  is  the  entropy 

function.  ,  ,,  ,  ,  ^ 

The  fact  that  the  MBT  effectively  reduces  the  entropy  of 
a  memoried  source  separately  from  the  choice  of  encoder  is 
significant  from  a  theoretical  perspective  because  it  separates 
the  coding  process  into  distinct  entropy  reduction  and  coding 


stages  rather  than  combining  both  operations  into  one  step 
Thus,  the  encoding  process  as  a  whole  becomes  more  general 
and  source-independent.  The  MBT  also  has  the  chmacteris- 
tics  of  transforming  source  densities  of  interest,  which  may  be 
compound  multi-modal  distributions,  into  simple  structured 
densities  with  a  predictable  parametric  shape  similar  to  the 
gamma  distribution.  This  is  useful  from  both  an  informa¬ 
tion  theoretic  and  statistical  perspective  because  it  provides 
an  effective  interface  between  real  world  data  sources  and  in¬ 
formation  theoretic  models  based  on  parametric  distributions^ 
In  this  case,  coding  models  based  on  statistically  determined 
gamma-type  distributions  may  be  directly  exported  to  a  va¬ 
riety  of  real-world  sources  via  an  MBT.  The  transform  intro¬ 
duced  here  is  similar  to  modulo  coding  schemes  of  1]  m  that 
it  does  not  increase  the  size  of  source  alphabet  supplied  to  the 
encoder.  This  contrasts  with  differential  coding  which  can 
potentially  increase  the  alphabet  size  by  a  factor  of  two. 

IV.  Experimental  Results 
The  MBT  is  applicable  to  both  traditional  and  non- 
traditional  sources.  Figure  1  demonstrates  the  application  of 
the  MBT  to  the  ubiquitous  Lena  image  for  M  -  1  •  tests  witn 
MBT-AH  (Adaptive  Huffman)  and  MBT-LZW  (Lempel-  iv- 
Welch)  coder  pairs  showed  significant  performance  gains  over 
those  by  either  AH  or  LZW  alone.  We  have  also  applied  an 
MBT- AH  coding  pair  to  the  indices  output  by  a  quan¬ 

tizer  in  conjunction  with  memory  knowledge  provided  by  the 
codebook.  We  have  found  that  MBT-AH  boosted  VQ  com¬ 
pression  performance  by  a  factor  of  nearly  1.5  m  comparison 
to  the  factor  of  1.1  for  AH  alone. 


iJlL 


(a) 

Fig.  1:  Reversible  Transformation  of  Lena  Image  pdf:  (a)  p(x), 
entropy  =  7.44bpp  (b)  p{y),  entropy  =  5.06bpp 


V.  Conclusions 

We  have  presented  a  reversible  memory-binding  transform  al¬ 
gorithm.  The  transform,  inserted  as  a  separate  stage  between 
the  source  and  coder,  serves  to  increase  perforinance  sepa¬ 
rately  from  choice  of  encoder.  This  transform  shows  much 
promise  for  use  in  the  fields  of  information  theory,  coding, 
and  statistics. 
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Abstract  —  Projection  pursuit  autoregression  and  projection 
pursuit  moving  average  with  multivariate  polynomials  as 
ridge  functions  in  both  cases  are  proposed  in  this  paper  (To 
write  the  methods  in  simplifed  forms  MPPAR  and  MPPMA, 
respectively).  The  La-convergence  of  such  the  methods  is 
proved.  This  paper  also  proposes  two  new  algorithms  for 
MPPAR  and  MPPMA.  By  using  the  methods ,  we  establish 
the  mathematical  models  about  the  Wolfer  sunspot  data  and 
Canadian  lynx  data. 


I  .  Introduction 

The  results  presented  here  were  motivated  by  the  research 
concerning  both  non-linear  non-parametric  time  series  anal¬ 
ysis  and  projection  pursuit  regression. 

I .  Projection  Pursuit  Autoregression 
AND  Its  Lj  -Convergence 

The  process  a:,  ,t  =  0 ,  ±  1 ,  ±  2 ,  •••  is  said  to  be  a  non-linear 
autoregression  of  order  k  (NLAR(  ifc  ))  process  if  Xi  is  sta¬ 
tionary  and  there  exists  a  function  f  such  that  x,  = 

f(x,-i,"’  ,x,-i)  +  s,,t  6  Z, where  z,  ~WN  (0,<r*)  . 

The  form  of  projection  pursuit  autoregression  model  can  be 
expressed  as 

M 

*<  =  ^ffj(  aTx)+  zt,t  6  Z, 

where  ajx  denotes  a  one-dimensional  projection  of  the  vec¬ 
tor  oy  is  projeaive  direction,  of 

=  (ujj ,  •••  ,aj^)  and  is  called  as  ridge  function  and  z,  ~WN 
(0,(r2). 

For  the  Lz  -convergence  of  MPPAR  we  have  the  follow¬ 
ing  Theorem  • 

Theorem.  Let  s,  be  a  non-linear  autoregressive  time  series 
NLAR(  k  ) ,  and  J p(ix')iP  •<  oo  ,  then  there  exist  ay  and 
5'y(a)  such  that  as  w  oo, 

[[/(*)— 

j-i 

where  jy(a)  is  a  polynomial,  S  —  {(*i, •■•,*»):  —  Ci 
^  Cl ,  ,  where cy,j  =!,•••,*,  are  large 

enough  positive  numbers. 

I .  Projection  Pursuit  Moving  Average 
AND  Its  Lt  -Convergence 

Let  x,,t^  Z  be  a  non-linear  moving  average  of  order  I  (NL- 
MA(  I  ))  process,  *,=  hCz,-i,  —  ,z,-i')  +  z,,t  6  Z, where 
z,  — WN  (  0 ,  ) ,  the  function  h  .  It  is  obvious  that 

MPPMA  has  the  Li  -convergence  as  following  corollary 
shows : 

Corollary.  Let  a:,  be  a  non-linear  moving  average  time  series 
NLMA(  I  ) ,  and  J  h^iz^dP  <  oo  ,  then  there  exist  ay  and 
gj(u)  such  that  as  w  -♦  oo  , 

f  [A(s)  —  aj’z)]*l«(z)  dP-*  0  , 

1=1 

where  jy  (a)  is  a  polynomial  and  S  =  { (si ,  ••• , S/ , ) ;  —  ci  ^ 


Si  ^ Cl , ••• ,  Ci^Zi^-Ci)  , where  Cy , )  1 , ^ ^ large 

enough  positive  numbers.  z’’=  (s,_i ,  •••  ,S(_(). 

IV  The  Algorithms  of  MPPAR  and 

We  propose  the  following  MPPAR  algorithm : 

step  1.  First  analysing  the  data  and  drawing  tj,e  scatter- 
plot,  we  use  the  Durbin-Levinson  method  and  criterion 
to  identify  the  order  k  of  the  fitted  model  and  ustimate  the 
variance  of  white  noise. 

step  2.  We  select  the  suitable  m  and  the  powei_„yjjjijgj 
polynomials  ^y(.  ) ,  j  =  1 ,  ,m  then  minimizin,^  tjig  objec¬ 
tive  function 

vCa,g)  =  aJx  )?  , 

i-i  j-i 

where  x"^  =  (zf-i ,  — .ZiRnd  *  are  observations,  Xi  is 
R  valued  and  x:  is  R*  valued  i=  —  We  i-gn  establish 

III 

the  preliminary  model,  =  ^gjC  aJx  )  +  . 

i-l 

Step  3.  Examining  the  residuals,  we  find  out  whether  the 
residuals  have  the  appearance  of  a  realization  of  white 
noise. 

step  4.  Repeat  step  2  and  3,  until  the  residual^  can  be  the 
appearance  of  a  realization  of  white  noise. 

The  algorithm  of  MPPMA  is  similar  to  the  algorithm  of  MP¬ 
PAR,  except  for  the  step  1. 


V.  The  Appucations 


By  using  the  new  methods,  we  establish  the  miiijcis  for  the 
wolfer  sunspot  data  and  Canadian  lynx  data,  respectively. 
The  mathematical  model  of  the  Wolfer  sunspot  ijgtg  (1700- 


X,  = —  0.  01431  -|-  0.  29984  a^jic-l-O.  00555aT:v _ o.  00602 

(arx-)*+  2.  , where  x^  =  (z,-,,!,-*),  aj  =  (Q.  88869  — 
0.  45851) ,  a?  =  (-  0.  65241,0.  75786).  z,  WN(0, 0. 

The  results  show  that  the  residuals  of  the  model  ^i-e  less  than 
5%. 

The  mathematical  model  for  Canadian  lynx  dtitn  rigoi 

1934). 

*«  = —  0.  02586  —  0.  0415  oiix — 0.  57932a2’x _ q.  02586 

(o!*)*,  where  x^  =  (*,_i,z,_2)  ,  aT  =  (0.  87928  —0 
48901),  a?-=(0.  09684,-0.  99530).  z.^WN(0,o! 


The  results  show  that  the  residuals  of  the  model  .,j.g  y^an 
2.  5%. 
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Abstract  -  We  consider  a  truncated  version  of 
the  entropy  estimator  proposed  in  [1]  and  prove 
the  mean  square  -y/n-consistency  of  this  estima¬ 
tor  for  a  class  of  densities  with  unbounded  sup¬ 
port,  including  the  Gaussian  density. 

Summary 

Let  Xi,...,Xn  be  a  sample  of  i.i.d.  random  variables 
with  common  density  f(x),x  G  M.  We  consider  the 
problem  of  estimating  the  unknown  entropy 

Hif)  =  -  y  f(x)  In  f{x)dx. 

This  problem  has  various  applications  in  hypothesis 
testing  and  information  theory.  There  exist  two  main 
approaches  to  the  construction  of  entropy  estimators. 
The  first  approach  consists  of  substitution  of  f{x)  in 
H{f)  by  a  suitable  nonparametric  density  estimator. 
The  second  approach  is  based  on  spacings.  Let  Xn,i  < 
^n.2  <  <  Xn,n  be  the  order  statistics  of  Xi,...,  A„. 

The  estimator 

-  n  — m 

Hm,n  =  -  y"  ln(-(X„,i+,„  -  Xn,i)), 
n  m 

i=\ 

where  m  is  a  positive  integer  less  than  n,  was  introduced 
in  [2].  Its  asymptotic  properties  as  n  — *■  oo  have  been 
studied  by  several  authors  under  various  assumptions 
on  /.  Here  we  study  an  entropy  estimator  which  is 
somewhat  different  from  Hm,n  and  is  defined  by 

Hn  =  -  ^ln{2pi7(n  -  1)} 

1=1 

where  p,  =  min{a„,  min^^^i  |X,-  —  Xj\},  a„  — ►  0  is  a 
sequence  of  positive  numbers,  7  =  expICfi},  and  Cb 
is  Euler’s  constant.  Hn  is  a  truncated  and  modified 
version  of  the  estimate  introduced  in  [1].  In  Theorem 
1  we  prove  that  the  bicis  of  i/„  is  of  order  0{-^),n  — + 
00.  In  Theorem  2  we  show  that  the  variance  of  Hn  is 
of  order  O(^).  Our  results  hold  for  densities  /  with 


unbounded  support  and  exponentially  decreasing  tails, 
such  as  the  Gaussian  density. 

Consider  the  following  assumptions  : 

(Ao)  //(x)|ln/(ar)|dx  <  00. 

(j4i)  /  is  twice  continuously  differentiable  and 

strictly  positive  on  IR. 

{M)  //(x)exp(-6/(x))dx  <  (76“^ 

where  C  is  a  finite  positive  constant. 

Theorem  1.  Assume  (Ao)  —  (A2).  Then,  as  n  — +  00, 

E{Hn)-H{f)  =  0{-^). 

Next  consider  the  conditions  : 

(J5i)  /  is  Lipschitz  continuous  and  strictly  positive 
on  IR. 

(B2)  There  exists  a  >  0  such  that  for  j  =  1,2,3 

j  /(®)( — <  °° 

J  °  f{x)dx  <  00. 

Theorem2.  Assume  (Aq),  (5i),  (B2)-  Then,  as  n  ^  00, 

E{Hn  -  EiHn)f  =  0{-). 

n 

Corollary.  Assume  (Aq)  —  (A2),52.  Then  Hn  is  \/n- 
consistent  in  the  mean  square,  i.e.,  as  n  — ►  00 

E{^{Hn-H{f))f  =  0{l). 
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Abstract—  Herein  we  apply  methods  of  Univer¬ 
sal  Classification  to  the  problem  of  classifying  one  of 
M  deterministic  signals  in  the  presence  of  dependent 
non-Gaussian  noise. 

Definition  of  Problem:  The  M-ary  Signal  Classifica¬ 
tion  problem  considered  is  defined  as  follows: 

Hi  :  X”~  Noise -hS(6»,)  i  =  l,2...M 

where  the  test  vector  X"  is  of  length  n.  The  M  -f  I*'* 
hypothesis  corresponds  to  the  case  in  which  X"  arises 
from  none  of  the  above  M  hypothesis.  In  addition,  we 
shall  assume  that  the  noise  arises  from  an  unknown  AT*'* 
order  Markov  source  and  that  hypothesis  Hi  corresponds 
to  “no  signal  present,”  that  is  S{0i)  =  0  for  all  n. 

In  this  work  we  develop  a  classification  scheme  which 
is  independent  of  the  true  statistical  model  of  the  environ¬ 
ment  and  still  achieves  many  of  the  desirable  properties 
of  the  globally  optimal  detector. 

Proposed  Classifier:  In  the  absence  of  a  statistic^ 
model  for  the  noise,  we  will  assume  the  existence  of  a 
length  N  training  vector  from  the  noise  source.  We 
propose  the  following  classifier  based  on  the  work  of  Ziv 
and  Gutman  in  [1,2]: 

N 

-A  ,i  =  1,2,  ...,M 
where  (.),  denotes  the  quantization  of  the  continuous  al- 

def  -Px” 

phabet  source,  dKL{Px^  ,Py«^)  =  E  log{  p^},  is 

the  Kullback  -Leibler  distance  between  the  types  Pyj 
and  of  the  quantized  data  and  A  is  a  positive  con¬ 
stant  chosen  to  satisfy  some  design  criterion.  The  de¬ 
cision  regions  {Ai,  A2, ...,  Am}  corresponding  to  the  M 
hypotheses  Hi,H2,  ■■■,Hm  are  defined  as  follows:  The  re¬ 
gion  Al  is  defined  as  the  set  of  all  sequences  (X",t^) 
for  which  /i,,(X",t^,0i,  A)  >  0  for  all  i  =  2,3...M.  The 
region  Kj  for  j  =  2,3...M  is  defined  as  the  set  of  all  se¬ 
quences  (X",t^)  for  which  /i,(X",t^,0i,  A)  >  0  for  all 


i  =  1,2...M, f  ^  j  and  h,(X",t^,0j,  A)  <  0  and  the  re¬ 
jection  region  Kr  =  {\Jf^i  h-iY- 

Summary  of  Theoretical  Results: 

1)  The  asymptotic  probability  of  error  under  each  hy¬ 
pothesis  decays  exponentially  fast  at  a  rate  A  as  the 
length  of  the  test  vector  X”  grows  without  bound, 

lim  —\ogP{,{elHi)<—X,i  =  \,...M 

n— foo  XI 

regardless  of  the  length  of  the  training  vector. 

2)  The  asymptotic  probability  of  detection  under  each 
hypothesis  tends  to  1  as  the  length  of  the  test  vector 
grows  without  bound  provided  that  lim„_^oo  v  ^ 
0  and  0  <  A  <  Ao, 

lim  Pti{kilHi)  =  l,i^l,...M 

n— ^co 

for  an  appropriately  chosen  constant  Ao. 

3)  The  probability  of  rejection  under  each  hypothesis 
for  iid  noise  sources  falls  off  exponentially  fast  as 
the  length  of  the  test  vector  grows  without  bound 
subject  to  the  above  constraints  on  n,  N  and  A, 

lim  ■ilogPA(Afl/Pj)  <  -Xj,j  =  1,2,  ...M 

n— ^■oo  XI 

where  Xj  >  0,  j  =  1, 2,  ...M. 
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Abstract  —  A  matrix  form  of  the  Brunn-Minkowski 
Inequality  is  derived,  which  may  be  applied  in  calcu¬ 
lating  the  uncoded  bit  rate  of  lattice  quantization  and 
modulation  schemes. 

I.  Introduction 

In  many  practical  coding  schemes,  the  bit  allocation  to  the 
symbols,  at  least  in  the  intermediate  phases  of  the  coding 
process,  is  done  according  to  geometric  consideration  rather 
than  to  probabilistic  ones.  For  instance,  the  number  of  ef¬ 
fective  code  words  of  a  lattice  quantizer  is  determined  by  the 
shape  and  the  volume  of  its  granulcir  region,  and  by  the  shape 
and  the  volume  of  its  Voronoi  cell.  Similarly,  the  size  of  a 
lattice  type  constellation  of  a  coded  modulation  system  is  de¬ 
termined  by  the  shape  and  the  volume  of  the  symbols’  decision 
cells,  and  the  shape  and  the  volume  of  the  space  of  allowable 
signals  (the  latter  is  determined,  e.g.,  by  peak  power,  peak 
spectrum  or  peak  amplitude  constraints). 

Hence,  the  bit  rate  at  the  quantizer  output,  or  in  the  mod¬ 
ulator  input,  is  generally  higher  than  the  overall  coding  rate 
of  the  system,  and  it  may  estimated  by  the  geometric  rate 

where  d  is  the  dimension  of  the  space  (of  source  or  channel 
signals).  Ax  is  the  region  of  input  signals.  An  is  the  basic 
cell  (of  the  quantizer  or  the  constellation),  p(A)  = 
is  the  (d-dimensional)  volume  of  the  region  A  ,  and  Ax  + 
An  =  {xA-y  :  x  €  Ax,y  €  A/v}  is  the  Minkowski  sum  of 
Ax  and  An-  This  sum  may  be  interpreted  as  the  geometric 
convolution  of  the  two  regions. 

As  in  the  problem  of  estimating  the  information  rate  in 
an  additive  noise  channel,  the  geometric  rate  Rc  is  also  not 
calculated  easily,  but  it  may  be  estimated  by  means  of  lower 
and  upper  bounds.  For  example,  the  volume  of  the  Minkowski 
sum  in  (1)  may  be  lower  bounded  via  the  Brunn-Minkowski 
Inequality  (BMI),  [1], 

p  (Ax  +  >  P  (Ax)^'^'^  +  P  (Ajv)^'^'*  .  (2) 

Equality  in  (2)  holds  if  the  two  regions  are  convex  and  pro¬ 
portional,  e.g.,  if  they  are  balls  or  cubes  (with  parallel  edges). 
For  d  =  1,  this  condition  is  reduced  to  the  simple  case  where 
Ax  and  An  are  intervals  (and  not,  e.g.,  a  union  of  intervals). 

The  BMI  is  dual  in  some  sense  to  the  Entropy-Power  In¬ 
equality  (EPI),  which  lower  bounds  the  entropy-power  of  the 
sum  of  independent  random  variables.  In  [2],  a  matrix  form 
for  the  EPI  was  derived,  leading  to  tight  lower  bounds  on  the 
capacity  of  an  additive  noise  channel  with  memory  or  with 
intersymbol  interference.  In  pEirallel  to  that,  we  derive  in  this 
work  a  matrix  form  for  the  BMI,  which  enables  to  give  a 
tight  estimate  for  Rc  in  cases  where  linear  transformations 
are  incorporated  with  coding  (“shaping”),  or  when  spectral 
constraints  upon  the  signals  are  given. 


II.  Linear  Transformation  of  Shapes 
We  first  introduce  the  matrix  form  of  the  Minkowski  sum.  Let 
=  (Ai  .  ..An)  be  a  (row)  vector,  whose  n  components  are 
d-dimensional  shapes.  We  define  a  linear  transformation  of 
Ai . . .  An  as 

TA  =  {Tss.  :  Xi  €  Ai  for  i  =  1 . . .  n}  ,  (3) 

where  T  is  an  m  x  n  matrix.  In  particular,  tA  means  scaling 
the  coordinates  of  A  by  the  scalar  t.  Note  that  TA  is  on 
md-dimensional  shape.  Denote  the  volumes  of  the  shapes  by 
fi(Ai)  =  pi,i  =  1 . .  .n.  Following  simple  laws  of  integration, 
the  md-dimensional  volume  of  TA  in  the  particular  case  m  = 
n,  is  p(TA)  =  \T\'^  ■  p(A)  =  |T|‘'  •  n"=i  Pi  .  where  |  ■  |  denotes 
the  absolute  value  of  the  determinant.  For  the  general  case, 
we  suggest  the  following  matrix  generalization  of  the  BMI: 

Theorem  1  (Matrix-BMI):  Let  A  —  (Ai...An)  be  a 
vector  of  d-dimensional  cubes  whose  edges  parallel  the  axes, 
and  whose  volumes  are  the  same  as  of  Ai .. .  An,  i-c.,  p(Ai)  = 
Pi,f  =  1 . .  .n.  Then 

(m) 

p(tA)^^‘'>p(tI)'^"  =  ^|7)|  (4) 

i=l 

where  T  =  T  •  L,  L  is  an  n  x  n  diagonal  matrix  whose  diago¬ 
nal  elements  are  (the  edges’  lengths  of  the  cubes 

Ai...An),  and  =  is  the  set  of  all  possible 

m  X  m  sub-matrices  of  T ,  obtained  by  choosing  m  out  of  the 
n  columns  ofT. 

For  m  =  1,  (4)  reduces  to  P  tiAi)  > 

12i=i  |til/^i'*^"^>  '-e.,  to  the  regular  BMI  (2).  Equality  in  (4) 
holds  in  each  one  (or  in  a  mixture)  of  the  following  Ccises:  if 
Ai . . .  An  are  cubes  whose  faces  parallel  each  other;  if  (after 
removing  the  all  zero  columns  of  T,  if  any)  m  =  n;  or  if  T 
does  not  have  a  full  row  rank,  where  then  p(TA)  =  0.  The¬ 
orem  1  is  proved  via  a  double  induction  over  the  dimensions 
of  T,  using  a  conditional  form  of  the  BMI,  analogously  to  the 
proof  of  the  matrix-EPI  in  [2]. 
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Abstract  —  A  limitation  to  wavelet  design  is  the  in¬ 
ability  to  construct  orthonormal  wavelets  that  match 
or  are  “tuned”  to  a  desired  signal.  This  paper  de¬ 
velops  a  technique  for  constructing  an  orthonormal 
wavelet  that  is  optimized  in  the  least  squares  sense, 
and  whose  associated  scaling  function  generates  an 
orthonormal  multiresolution  analysis  (OMRA). 

I.  Introduction 

Most  applications  of  orthonormal  multiresolution  analyses 
(OMRA)  use  either  Daubechies’,  Meyer’s,  or  Lemarie’s 
wavelets  [1,  2,  3].  However,  it  would  be  best  if  the  wavelet 
matched  the  signal  of  interest.  This  paper  presents  a  tech¬ 
nique  for  generating  an  OMRA  with  a  wavelet  that  is  matched 
in  the  least  squares  sense  to  a  signal  of  interest  by  first  devel¬ 
oping  a  method  for  constructing  the  scaling  function  from  the 
wavelet  and  second,  giving  the  conditions  on  the  wavelet  that 
guarantee  an  OMRA. 

11.  Multiresolution  Decomposition 

Mallat  [1]  showed  that  the  discrete  wavelet  transform  can  be 
used  to  generate  an  orthonormal  multiresolution  decomposi¬ 
tion  of  a  discrete  signal  consisting  of  a  series  of  detail  functions 
and  a  residual  low  resolution  approximation  of  the  original 
signal.  The  decomposition  is  done  by  convolving  the  original 
sequence  with  a  pair  of  quadrature  mirror  filters,  h  (low  pass) 
and  g  (high  pass).  In  order  to  perfectly  reconstruct  the  orig¬ 
inal  signal  from  the  detail  functions  and  the  residual  approx¬ 
imation,  the  following  must  be  true  of  the  Fourier  spectrum 
magnitudes  of  h  and  g. 

|/f(a;)r-l-|G(u;)|*  =  l  (1) 

Cancellation  of  any  ali^lsing  is  guaranteed  by  setting 
gk  =  The  filters,  h  and  g,  are  related  to 

the  mother  wavelet,  V'(®)i  scaling  function,  <^(x), 

by  their  2-scale  relations  [2],  ip(^x)  =  2^i^gk<t>{2x  —  k)  and 
<f>{x)  =  2  hk<p{2x  —  k),  or  in  the  frequency  domain  by 

®(a^)  =  G(|)$(|)  $(^)  =  -^(1)^(1)  (2) 

III.  Constructing  $  from  $ 

A  recursive  equation  for  finding  $(a))  from  $(0;)  can  be  found 
by  taking  the  magnitude  squared  of  the  equations  in  (2), 
adding  them  and  substituting  equation  (1)  giving 

|$(a;)r  =  |#(2a;)|^-f|4-(2a;)r  (3) 

Substituting  w  =  nk,  then  w  =  7rfc/2  and  so  on,  leads  to  the 
following  closed  form  solution. 

KF)r=EKi^)f w 

n=0 


and  as  in  Mallat[l],  $(0)  =  1.  So,  given  any  known  wavelet, 
its  corresponding  scaling  function  can  be  found  directly  from 
equation  (4). 

IV.  Guaranteeing  Orthonormality 

Given  that  gk  =  (— l)*/ii_t  and  $(0)  =  1,  the  multiresolution 
generated  by  (p{x)  and  related  to  ip{x)  is  orthonormal [1,  2] 
if  <  (l>{x),<p{x  —  k)  >=  S{k),  or  in  the  frequency  domain, 
53m=-oo  27rm)p  =  1.  Applying  this  condition  to 

equation  (4)  and  letting  Aw  =  n/2^  gives  the  following  condi¬ 
tion  on  ^(w)  that  will  guarantee  an  orthonormal  multiresolu¬ 
tion  analysis. 

00  /  2 

E  EK“”‘(|S+H)I  =’ 

m.=  — cx>  n=0 

Any  wavelet  that  satisfies  condition  (5)  can  be  used  in  equa¬ 
tion  (4)  to  generate  an  orthonormal  scaling  function  that,  in 
turn,  generates  an  orthonormal  multiresolution  analysis. 

V.  Finding  Matched  Wavelets 

Assume  in  equation  (5)  that  'I'(w)  is  bandlimited  to  n  -  Kl  < 

|w|  <  TT  ■  Ku-  Then  for  each  value  of  £  =  0, 1 . N  where 

Aw  =  7r/2^  is  the  sample  spacing  chosen  for  $(fcAw),  a  set 
of  M  equality  constraints  on  ’^'(fcAw)  can  be  derived  with  the 
following  form 

M  ,  .  ,2 

i=l 

where  Oik  =  {0, 1).  Given  W (fcAw)  as  the  desired  signal  spec¬ 
trum,  the  equality  constraints  in  (6)  along  with  the  inequality 
constraints,  0  <  l^'(k7r/2^)|*  <  1  can  be  solved  using  non¬ 
linear  programming  techniques  where  the  objective  function 
/  =  5^j.(|W(fcAw)|  — |$(fcAw)|)^  is  minimized.  The  result  is  a 
wavelet  spectrum  that  satisfies  the  conditions  for  orthonormal¬ 
ity  and  is  matched  to  the  desired  spectrum,  TY(fcAw).  Since 
the  resultant  wavelet  spectrum  is  magnitude  only,  the  wavelet 
is  symmetrical. 
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Abstract  —  This  paper  proposes  a  new  coding  strat- 
egy  by  which  desired  quality  of  reproduced  signal  can 
be  guaranteed  in  the  minimum  cost  of  coding  rate. 

I.  Introduction 

In  [1],  we  introduced  discrete  orthonormal  wavelet  transform 
(DOWT)  into  the  ECG  data  compression  and  obtained  the 
results  of  compression  ratios  (CR)  from  13.5  :  1  to  22.9  :  1 
with  the  corresponding  percent  rms  difference  (PRD)  between 
5.5%  and  13.3%.  Although  we  have  achieved  a  dramatic  im¬ 
provement  on  compression  performance  over  traditional  ECG 
compression  schemes,  both  in  our  previously  proposed  ECG 
coding  system  and  in  other  lossy  compression  schemes,  there 
exists  an  unresolved  problem  that  the  desired  quality  of  the  re¬ 
produced  signals  is  hardly  guaranteed  in  the  minimum  cost  of 
coding  rates.  For  ECG  compression,  such  a  problem  becomes 
extremely  crucial  because  if  the  quality  of  the  reproduced  sig¬ 
nals  can  not  be  guaranteed  the  compression  will  be  useless. 
The  current  solution  to  this  problem  is  to  sacrifice  the  coding 
rates  in  order  to  maintain  the  quality  of  reconstructed  signals. 
Obviously,  this  is  not  an  efficient  manner.  In  this  paper,  we 
propose  a  DOWT-based  compression  scheme  by  which  desired 
PRD  between  the  original  ECG  signal  and  the  reproduced  sig¬ 
nal  can  be  guaranteed  in  the  minimum  cost  of  coding  rate. 

II.  Coding  Strategy 

The  DOWT-based  ECG  coding  system  is  a  hierarchical  struc¬ 
ture  composed  of  three  parts,  the  DOWT  unit,  the  quantizing 
unit  and  the  entropy  coding  unit.  By  a  J-layered  such  a  cod¬ 
ing  system,  an  input  discrete  signal  ao  is  progressively  decom¬ 
posed  into  a  set  of  sub-signals  {aj,  (dj)i<j<j},  where  aj  is 
the  lowest  frequency  sub-signal  and  {(dj)i<j<j}  are  the  dif¬ 
ferential  details  at  different  frequencies.  These  sub-signals  are 
quantized,  entropy-encoded  and  transmitted  to  the  receiver. 
At  receiving  side,  the  quantized  sub-signals,  {a'j, (d' 
are  used  to  reconstruct  the  original  signal.  Let  ej ,  ej  denote 
the  quantization  MSE’s  of  dj  and  aj,  respectively,  according 
to  [2],  the  reconstruction  MSE  between  ao  and  its  reproduc¬ 
tion  is  given  by 

7  =  ej  +  ^ej.  (1) 

i=i 

Based  on  Eq.  (1),  the  problem  of  guaranteeing  a  desired  re¬ 
construction  MSE  70  in  the  minimum  cost  of  coding  rate  can 
be  formulated  as 

j 

minimize  /i(7o)  =  (2) 

j=i 

subject  to  7  =  70,  (3) 

where,  h(7o)  is  the  output,  hj  and  hj  are  the  entropies  of  aj, 
dj  after  quantization,  respectively.  And  7  is  the  reconstruc¬ 
tion  MSE  between  ao  and  its  reproduction  given  by  Eq.  (1). 


The  following  optimum  solution  can  be  obtained  by  using  La¬ 
grange  multiplier 

and  £^  =  ^70.  (4) 

Therefore,  as  soon  as  the  quantization  MSE’s  determined  by 
Eq.  (4)  are  achieved,  the  desired  reconstruction  MSE  70  (or 
the  corresponding  PRD)  can  be  obtained. 

How  to  achieve  the  desired  quantization  MSE’s  determined 
by  Eq.  (4)  is  another  key  point  to  realize  our  coding  strat¬ 
egy.  In  what  follows,  we  propose  an  adaptive  quantizer 
by  which  the  desired  quantization  MSE  is  really  achievable. 
For  a  uniform  quantizer,  the  quantization  MSE  is  given  by 
ei  =  K  where  A  is  a  constant  factor,  A<  is  the  quan¬ 

tization  step-size.  It  is  easy  to  see  that,  by  adjusting  the 
step-size  Ai,  the  desired  quantization  MSE  can  be  achieved. 
More  details,  for  an  expected  quantization  MSE,  say,  eo,  we 
ran  randomly  choose  an  initial  step-size  Ao  to  do  the  quanti¬ 
zation,  then  an  actual  quantization  MSE  txo  is  obtained.  We 
compare  it  with  the  expected  £0,  if  a  given  precision  is  not 
satisfied,  replace  Ao  with  sjtcjoa  Ao  and  repeat  the  process 
imtil  a  satisfactory  precision  is  reached.  Experiments  have 
shown  that  the  convergence  usually  finished  within  about  3 
iterations. 

III.  Experiments 

We  have  tested  our  proposed  coding  scheme  at  different  de¬ 
sired  PRD’s.  The  ECG  data  is  taken  from  the  MIT-BIH  Ar¬ 
rhythmia  Database  Record  200.  The  experimental  results  are 
shown  in  Table  1.  The  reconstructed  ECG  were  evaluated  by 
cardiologists  and  it  seems  clinically  acceptable  even  at  the  CR 
as  high  as  22.7  :  1. 

IV.  Conclusion 

In  this  paper,  we  proposed  a  new  coding  strategy  by  which 
desired  quality  (PRD)  of  reproduced  signal  can  be  guaranteed 
in  the  minimum  cost  of  coding  rate.  The  idea  was  successfully 
introduced  to  the  DOWT-based  coding  system  for  the  ECG 
compression  application. 
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Desired  PRD(%) 

7 

9 

13 

Actual  PRD(%) 

6.7 

8.9 

13.2 

CR 

12.4:1 

16.0:1 

22.7:1 

Table  1:  Summary  of  the  compression  performance 
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Abstract  —  An  adaptive  Markov  model  with  three  states  4)  i=i+l,  return  2). 

for  mobile  communication  channel  is  studied  and  After  generating  the  error  sequence,  we  estimate  its 
simulated.  The  error  sequence  describing  the  long  burst  performance  by  computing  its  burst  interval  length 

error  characteristics  of  the  model  channel  is  generated  on  distribution  probability  G(m)  and  average  bit  error  rate  P^, 

a  computer  based  on  the  model.  A  test  method  using  for  corresponding  value  of  C/N. 
threshold  technique  is  given  to  verify  the  accuracy  of  the 

adaptive  channel  model.  IV.  MODEL  ACCURACY  TEST  AND  CONCLUSIONS 


I.  INTRODUCTION 

Modelling  mobile  communication  chaimel  is  a  prerequisite 
to  improve  the  channel  error  performance  by  error-error- 
control  technique.  Set  up  a  not  only  universal  but  also 
accurate  statistic  model  for  mobile  communication  channel 
is  of  great  practical  interest.  In  this  paper,  we  apply  Markov 
model  with  three  states  to  setting  up  an  adaptive  statistic 
model  for  mobile  communication  channel  based  on  a  number 
of  field  test  curve.  The  model  parameters,  i.e.,  the  elements 
of  the  chaimel  transition  probability  matrix  P,  are  expressed 
as  the  functions  of  the  average  carrier-to-noise  ratio  (C/N). 
This  is  because  among  the  factors  which  dominated  the  burst 
error  characteristics  of  mobile  channel,  C/N  plays  an 
important  role.  We  accomplish  following  work: 

II.  SETTING  UP  THE  ADAPTIVE  CHANNEL  MODEL 

First,  we  use  a  simple  partitioned  Markov  model  with  three 
states  as  the  probability  statistic  model  to  be  set  up 
describing  the  long  burst  error  characteristics  in  mobile 
communication  channel.  The  parameters  of  the  state 
transition  probability  matrix  P={pjj}(i,j=l,2,3)  of  the  model 
and  the  burst  interval  length  distribution  G(m)=A,e“‘'"+A2e“^"’ 
of  the  channel  error  sequence  have  following  relations: 

Pii=e“‘  P22=e“^  P3i=Aie“‘-  P3j=A2e“' 

Pl3=l-Pu>  .P23=l-P22  P33=l-P.31-P32- 

Then,  according  to  a  group  of  field  test  curves  under 
different  values  of  C/N,  which  is  measured  in  a  typical  and 
mobile  propagation  enviromnent,  we  set  up  the  adaptive 
Markov  model  with  three  states  with  its  parameters  which 
are  the  functions  of  C/N  by  the  curve  fitting  with  the  method 
of  nonlinear  least  square. 

III.  GENERATING  THE  ERROR  SEQUENCES  BASED 

ON  THE  MODEL  SET  UP 

According  to  the  parameters  of  the  adaptive  channel  model 
above  set  up,  we  generate  an  error  sequence  {e^},  which 
describes  the  long  burst  error  characteristic  of  mobile 
channel  under  different  values  of  C/N,  on  a  computer  by 
following  method: 

1)  Set  i=0,  assume  an  initial  state  Sq  (S(,=1,  2  or  3); 

2)  Generate  a  pseudo-random  number  q  evenly  distributed 
in  the  interval  [0,1]  by  the  hybrid  congruence  method. 

3)  Determine  the  value  of  e^  under  the  current  state,  and 
judge  the  next  state  according  to: 

a)  If  Si=3,  ei=l, 

/I  when  ri(p3i 

Sj+i  =p  when  P3i^i<P3)+pj2 

V3  when  Psi+Ps^^i 

b)  If  s.=j^3,  ei=0, 

/j  when  ri<pjj 

k3  when  r>Pjj  ' 


To  verify  the  accuracy  of  the  adaptive  channel  model 
above  set  up,  we  present  a  test  ihethod  using  threshold 
technique  and  its  fundamental  principle  is  as  follows: 

Since  the  burst  error  characteristics  of  mobile 
communication  channel  is  regarded  as  a  Rayleigh 
distribution  (here  only  consider  the  case  with  severe  fading), 
we  can  verify  the  accuracy  of  the  adaptive  channel  model 
above  set  up  by  checking,  for  each  value  of  C/N,  if  the  burst 
error  length  distribution  in  {ej  generated  accords  with  a 
Rayleigh  distribution. 

We  can  first  generate  a  random  number  with  Rayleigh 
distribution  (denoted  by  s^)  from  a  random  number  evenly 
distributed  in  the  interval  [0,1]  (denoted  by  r^)  according  to 

S^  =  u^-2lnr^  (1) 

there  u  is  the  mean  value  of  Rayleigh  distribution,  and 
therefore  generate  a  random  sequence  {sj  with  Rayleigh 
distribution.  Then  we  set  up  a  threshold  B  as  follows 

(2) 

where  P^  is  average  bit  error  rate  of  a  random  sequence.  The 
threshold  values  for  various  values  of  C/N  are  obtained  by 
letting  P^  be  equal  to  the  P^  under  corresponding  value  of 
C/N  computed  in  the  above  performance  estimation. 

According  to  above  threshold  value  for  each  value  of  C/N, 
we  quantize  {Sj}  into  a  (0,1)  sequence  with  Rayleigh 
distribution  by  following  threshold  comparison  method,  i.e., 
ei=0,  when  Si(B 

ei=l,  when  s^^ 

For  the  (0,1)  sequence  with  Rayleigh  distribution,  we 
estimate  its  performance  by  computing  its  burst  interval 
length  distribution  G(m)"  and  average  bit  error  mte  P^".  Then 
by  comparing  G(m)"  and  P^"  with  G(m)'  and  P^'  respectively 
under  the  same  value  of  C/N,  we  observe  that  both  are  very 
close  for  each  value  of  C/N,  so  the  adaptive  channel  model 
is  accurate  to  a  great  extent.  Besides,  compared  with  the 
conventional  general  partitioned  Markov  model  where  there 
are  a  number  of  error  states,  our  model  is  more  practical 
since  it  is  easy  to  compute.  In  one  word,  it  is  a  feasible 
scheme  for  optimizing  mobile  communication  channel 
model. 
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Abstract  —  Digital  data  transmission  signals  may  be  con¬ 
sidered  as  some  specific  stochastic  process  controlled  by  a 
Markov  chain.  Briefly  going  into  the  presentation  and  evalua¬ 
tion  of  the  power  density  spectra  (PDS)  of  such  processes,  one 
of  our  major  concerns  deals  with  the  computational  effort.  By 
some  special  grouping  among  the  employed  signal  elements  and 
a  corresponding  partitioning  of  the  controlling  transition 
matrix,  the  formula  for  the  desired  PDS  can  be  simplified  to  an 
Euclidean  vector  norm  expression.  By  means  of  several  PDS 
graphs  the  relevance  of  such  an  analysis  to  evaluate  or  design 
real  transmission  systems  may  be  appreciated. 

1 .  Markov  Ghain  Models 

The  signals  usually  employed  for  digital  data  transmission  can 
be  considered  as  some  running  sequences  of  discrete,  especially 
shaped  signal  elements  ([1]).  Coding  or  modulating  devices  will 
generally  introduce  some  memory  into  the  system,  thus  the  next 
signal  element  to  be  sent  may  depend  on  one  or  more  of  the  previ¬ 
ously  sent  elements.  Therefore  the  Markov  chain  theory  ([2])  pro¬ 
vides  veiy  effective  tools  for  exploring  such  signals,  (e.g.  [3],[4]). 

In  detail  we  consider  an  isochronological  stationary  Markov 
process  (pace  width  =  1-T)  that  is  internally  controlled  by  a  homo¬ 
genous  Markov  chain.  For  the  chain  itself  we  denote  the  transition 
matrix  £  =  (Pk,i)  and  the  absolute  state  distribution  a  =  (qi-qK). 
either  one  being  independent  of  the  observation  time.  To  prevent 
tedious  discussion  here,  we  assume  the  Markov  chain  to  be  ergodic 
([2]),i.e.  £'”=>(l..l)^-(qi-qK)  doss  exist  and  all  qk  are>0. 

For  the  external  process  realization  we  consider  signal  elements 
that  are  individually  assigned  to  the  internal  states.  For  convenience 
we  specify  them  here  in  the  frequency  domain  and  denote  them  as 
Sk(co).  In  any  case  the  equality  of  Sk(co)  =  S((ci))  assigned  to  dif¬ 
ferent  states  is  conformable  with  our  model. 

Modeling  the  system  comprises  first  of  all  the  decision  on  the 
abstract  Markov  states,  the  real  signal  elements  and  the  relationship 
between  them.  Eventually  the  codulator  operation  and  the  statistics 
of  the  source  data  must  be  introduced  into  the  state  transition 
matrix.  In  an  advanced  model  one  may  think  about  grouping  the 
external  signal  elements  in  special  classes  such  that  some  partitio¬ 
ning  of  the  transition  matrix  becomes  obtainable  which  can  finally 
lessen  the  computational  effort  considerably. 

2.  The  Power  Density  Spectra  of  Markov  Processes 

Here  we  are  interested  in  evaluating  the  power  density  spectra 
(PDS)  of  our  transmission  signals.  According  to  the  Markov  chain 
theory  a  basic  formula  can  be  derived  (e.g.  [3], [4]  et  alt.)  which 
then  may  be  rewritten  in  an  Hermitean  form  using  £  =  Diag{\..\), 
i2.=  Dmg(q,.qK),£  =  (S,(fi))...SK(o)))^  and  z  =  exp(-ja)t). 

PDS„„,  =  £"  (d-zP)*)-'  ■(Q-P^QP)  (l-2P)-'  •£  • 

As  a  first  application  of  this  formula,  we  look  at  the  maximum 
entropy  process  which  is  typically  equipped  with  equally  distri¬ 
buted  state  probabilities  g  =  l/K-  (1..1)  or  Q=  1/K-/  ,  respec¬ 
tively,  and  the  transition  matrix  £  =  (l..l)^'a  •  In  this  special  case 
the  symmetric  factor  in  the  center  of  the  matrix  product  (1)  reads  in 


particularly  as  (Z-  (l..l)’^-a)'K‘''^.  Now  it  is  important  to  ap¬ 
preciate  that  the  last  symmetric  matrix  is  idempotent  ([2]).  Thus  it 
finally  turns  out  that  the  former  equation  (1)  is  obviously  equivalent 
to  the  following  Euclidean  vector  norm  expression 

PDSom  =  I  •  rz  -  0  •  •  l/  •  9)1  ir  •  (2) 

For  the  more  general  statistically  independent  processes  where 
the  Markov  states  are  distributed  as  a  =  (qi..qK)>  A®  analog  result  is 
easily  verifiable: 

PDSc„„,  =1  ■9)-i  f  •  (3) 

Eventually  we  may  explore  L-value  FSK  schemes  with  conti¬ 
nuous  phase  characteristics.  It  is  most  profitable  to  organize  the 
signal  elements  in  a  2  level  hierarchy:  At  first  we  define  L  classes 
of  elements  in  respect  to  their  fi'equency  parameter  f^,  and  then  we 
identify  ail  possible  phase  values  which  occur  at  the  start  of  each 
element  and  we  denote  the  necessary  quantity  of  them  by  M. 
Therefore  the  total  of  needed  Markov  states  adds  up  to  K=M  L. 

The  proposed  structuring  of  the  signal  elements  immediately 
motivates  a  conforming  partitioning  of  the  transition  matrix  using 
block  matrices  ([5]).  Thus  for  statistically  independent  source  data, 
one  can  formally  establish  £  =  (Cm“'-.  Gh'  )  (qr^i  •• 
where  Q  stands  for  a  cyclic  matrix,  the  exponents  refer  to  the  phase 
differences  between  the  beginning  and  end  of  the  various  signal 
elements  of  frequency  fk,  and  the  subscripts  indicate  the  matrix 
dimension.  Using  the  notion  of  a  Kronecker  product  [5]  one  can 

still  rewrite  £  =  (G,i“'..  i^‘‘*')’^’(qi"qL)®iM- 

Following  the  procedure  of  the  previous  examples  we  finally 
arrive  again  at  an  Euclidean  vector  norm  expression  for  the  PDS. 

Although  this  expression  could  be  further  evaluated  in  general 
form  we  won’t  discuss  this  issue  here  in  anymore  detail. 

3.  Applications  and  Future  Work 

In  the  project  “DIG-SPEC  -  Power  Spectra  of  Digital  Data 
Signals",  the  PDS  formulas  for  several  modulated  carrier  as  well  as 
for  baseband  coded  signals  were  evaluated.  A  program  package  for 
computation  using  these  formulas  and  displaying  the  graph  on  stan¬ 
dard  VDU  is  already  available.  This  program  system  will  support 
investigations  of  frequency  characteristics  and  bandwidth  require¬ 
ments  of  existing  or  new  codulation  schemes. 

In  theory  we  will  carry  out  further  studies  of  Markov  processes 
employing  transition  matrices  which  have  block  structure  and  are 
particularly  capable  for  treatment  by  Kronecker  products. 
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Abstract  -  The  goal  of  this  work  is  to  analyze  the  advan¬ 
tages  of  recently  introduced  entropy-constrained  code-excited 
linear  predictive  (EC-CELP)  quantization  [1],  The  analysis 
is  at  low  rates  and  in  comparison  with  other  EC  quantization 
schemes.  Based  on  iV-th  order  rate-distortion  function  (RDF), 
EC  quantization  theory,  and  empirical  methods,  RDF  mem¬ 
ory  gain  and  empirical  space-filling  gain  (dimensionality  N) 
at  low  bit  rates  are  defined  and  calculated.  These  gains  cate¬ 
gorize  and  help  us  analyze  and  compare  the  available  coding 
gains  for  various  EC  coders  for  a  given  rate  and  delay  (N). 

EC-CELP  and  other  EC  quantizers  EC-CELP  addresses 
the  problems  associated  with  high-quality  (near  rate-distortion 
bound)  quantization  of  sources  with  memory,  operating  at  low 
bit  rates,  with  minimal  delay,  and  low  complexity.  The  objec¬ 
tives  are  met  through  combining  advantages  of  VQ,  predictive 
coding  (PC),  and  analysis-by-synthesis  with  merits  of  closed- 
loop  entropy  constrained  (EC)  codebook  design  (details  in  [1]). 

Other  EC  schemes  include  EC  scalar  quantization  (ECSQ) 
and  vector  quantization  (ECVQ).  For  sources  with  memory, 
configurations  which  use  a  suitable  memory  removal  technique, 
such  as  transform  coding  (TC)  and  PC,  result  in  more  effi¬ 
cient  combination  techniques  with  lower  delay  (dimension  N). 
They  include  EC  block  transform  quantization  (EC-BTQ), 
EC-DPCM,  and  EC  predictive  VQ  (EC-PVQ)  [1].  All  of  the 
above  EC  coders  (except  EC-BTQ)  can  be  shown  to  be  spe¬ 
cial  cases  of  EC-CELP  (EC-CELP’s  better  performance).  We 
use  a  stationary  first  order  Gauss-Markov  (GM(1))  source  for 
our  comparisons  and  analysis.  Fig.  1  shows  the  performance 
advantage  of  EC-CELP  over  other  EC  schemes  for  a  given  N. 
EC-BTQ  results  for  a=0.9  are  from  a  previously  published 
work.  For  other  a’s,  the  values  are  predicted  from  ECVQ. 

Coding  gains  analysis  This  analysis  is  based  on  RDF  val¬ 
ues  (SNRrdf))  EC  quantization  theory,  and  empiriceil  meth¬ 
ods.  Using  a  modified  analysis  scheme  of  [2],  for  low  rates  and 
general  EC  coders  we  define  coding  gains  over  basic  ECSQ  of 
RDF  memory  ajid  empirical  space  filling  gain.  For  a  given 
dimension  N  and  rate  R,  using  a  GM(1)  source  we  have 

R)  =  SNR°“^^’(iV,  R)  -  SNR^o'f  °*'”"“(iV,R) 
A;im.,°‘"”“'’(iV,  R)  =  SNR-e),"vq°*“’"‘“'(A',  R)  -  SNR^cscf*"’"*” 

For  the  low  rate  region  the  A-th  order  RDF  is  obtained  para- 
metricly.  The  top  and  bottom  graphs  in  Fig.  2,  show  the 
memory  and  filling  gains.  For  EC-CELP,  the  ideal  PC  effec¬ 
tive  N  is  high  and  hence  should  nearly  provide  the  high  N 
gains  (top  graph).  The  middle  graphs  show  predicted  memory 
gain  for  other  coders.  The  analysis-by-synthesis  feature  of  EC- 
CELP,  in  effect  provides  for  intra-block  PC  gain  (EC-CELP 
advantage  over  EC-PVQ).  As  the  high-R  ideal  PC  estimated 
PVQ  gains  in  Fig.  2  show,  loss  of  memory  gain  due  to  lack  of 
intra-block  PC  gain  could  be  substantial.  An  EC  coder  SNR 
is  approximately  the  SNReJsq‘'"”-|-  coder  memory  and  filling 
gains.  The  combined  memory  and  filling  gains  over  ECSQ  of 

*  This  reseeirch  was  supported  in  part  by  a  gr2int  from  the  Ccinadiaxi 
Institute  for  Telecommunications  Reseeirch  (CITR)  vmder  the  NCE 
program  of  the  Government  of  Canada. 

^  Electrical  Engineering,  McGill  University,  3480  University  Street, 
Montreal,  PQ,  CANADA  H3A  2A7 

^  INRS-Telecommunications,  Universite  du  Quebec,  16  Place  du 
Commerce,  Verdun,  PQ,  CANADA  H3E1H6 
E-mail:  {foodeei,  ericjOINRS-Telecom.UQuebec.CA 


EC-CELP  for  a  given  N  and  R  is  the  highest.  Hence  it  yields 
the  highest  SNR  (Fig.  1).  The  efficient  memory  removal  in 
EC-CELP  allows  for  the  concentration  of  VQ  on  the  remaining 
memory  redundancies  (especially  quantization)  and  the  filling 
gain.  Also  since  the  resulting  EC-CELP  codebook  size  is  not 
high  the  resulting  EC-CELP  complexity  is  also  relatively  low. 
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1-D  EC.CELP  or  EC-DPCM.  3  S  8-D  EC-CELP  (-,  bottom  to  top);  RDF  (solid) 


Fig.  1  Performance  advantage  of  EC-CELP. 


a=.99  .95  .9  ,5  .2  (high  N)  a=0.99) 


Fig.  2  Top  graphs:  High  and  low  N  RDF  memory  gains 
for  Gauss-Markov  source  with  coefficient  a.  Middle  graphs: 
Analysis  to  show  a  comparison  of  theoretical  memory  gains 
between  VQ/TC,  PVQ,  and  CELP.  Bottom  graphs:  low  rate 
empirical  space  filling  gains  for  i.i.d  Gaussian  source. 
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Abstract  -  A  model  is  proposed  in  which  the  neuron  serves  as 
an  information  channel.  An  application  of  the  Shannon 
information  measures  of  raitropy  and  mutual  mformation  tak^ 
together  in  the  context  of  the  proposed  model  lead  to  the 
Hopfield  neuron  model  with  a  conditionalized  Hebbian  learning 
rule  and  sigmoidal  transfer  characteristic. 

1.  INTRODUCTION 

A  maximum  entropy  (ME)  formulation  is  shown  to 
provide  the  basic  functional  form  of  the  model  neuron  including 
synaptic  weights  and  a  sigmoidal  transfer  characteristic.  This 
formulation  required  the  assumption  of  a  set  of  measurement 
functions  which  are  in  turn  a  function  of  both  synaptic  inputs 
and  neuron  output.  Furthermore,  an  ME  formulation  requires 
the  specification  of  the  statistical  moments  of  the  selected 
measurement  functions  which  must  somehow  be  supplied  by  an 
unspecified  source.  An  ME  formulation  is  underconstrainted  in 
the  sense  that  the  model  neuron  cannot  find  a  uniquely 
preferable  set  of  moment  constraints.  Alternatively,  a  maximum 
mutual  information  (MMI)  formulation  is  shown  to  be  fully 
constrained  in  this  regard  and  can  make  exclusive  use  of  locally 
available  information.  Solutions  take  the  form  of  the  Hopfield 
neuron  model  with  a  requirement  for  a  feedback  learning 
methodology  which  takes  the  form  of  Hebbian  learning  whereby 
the  synaptic  weights  are  only  modified  in  response  to  the 
conjunction  of  input  and  output  neural  events.  A  modification 
of  an  adaptation  equation  of  Oja  [1]  provides  for  an  algorithmic 
solution. 

11.  MAXIMUM  ENTROPY  (ME)  FORMULATION 

An  ME  formulation  yields  a  Boltzmann  distribution  for 
a  single  neuron  which  extracts  (measures)  certain  moments  from 
its  environment.  A  maximum  likelihood  decision  rule  results 
which  corresponds  to  that  of  a  deterministic  Hopfield  [2]  neuron 
model.  A  stochastic  decision  rule  is  also  possible  which  first 
requires  the  computation  of  an  evidence  function  which  can  then 
be  passed  through  a  sigmoidal  non-linearity .  Described  results 
require  the  specification  of  the  moments  of  a  set  of  N 
measurement  functions  by  a  nonspecific  supervisor.  This  is 
considered  undesirable  regarding  the  development  of  useful 
computing  structures  constrained  to  use  locally  available 
information  only. 

III.  MAXIMIZED  MUTUAL  INFORMATION  (MMI) 

Maximization  of  the  mutual  information  between  the 
neuron  vector  input  x  and  output  y  can  be  accomplished  if  a  ME 
distribution  form  is  assumed  for  P(x,)').  The  objective  is  to  find 
the  Lagrange  set  XeA  which  maximizes  the  mutual  information 
between  x  and  y.  This  requires  finding  the  extremum  of  an 


objective  function  J(X)  =  I(x;y;X)  +  over  some 

permissible  ACR'^  where  the  additional  constraint  X^X=y  is 
imposed.  Without  this  constraint,  an  obvious  extremum  is  X=0 
in  which  case  I(x;y;X) = 0.  A  derived  Gibbs  Mutual  Information 
Theorem  states  that  the  extrema  of  J(X)  can  be  found  by  solving 
a  system  of  linear  equations  which  lead  to  a  conditionalized 
principal  component  analysis  of  the  neural  input.  This  results 
in  a  Hebbian  learning  rule  analogous  to  biological  models.  This 
learning  rule  attempts  to  simultaneously  minimize  the  conditional 
entropy  of  the  output  given  the  input  H(y  |  x)  and  also  the 
entropy  of  the  output  H(y)  such  that  P(y)=  1/2  implying  that  the 
neuron  output  has  maximum  entropy  H(y)  for  a  one-bit  channel. 
An  extremely  simple  numerical  algorithm  serves  to  implement 
the  developed  strategy.  Simulation  results  verify  analytical 
derivations  using  simulated  test  data.  These  results  indicate  the 
model  neuron  automatically  distinguishes  input  vectors  into  two 
equally  probable  classes  based  on  degree  of  similarity.  The 
biological  equivalent  of  an  action  potential  is  generated  for  the 
preferred  class. 
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Abstract  —  Sequential  properties  of  the  Viterbi  al¬ 
gorithm  are  studied  basing  on  a  renewal  sequence  of 
the  most  informative  stopping  times  which  can  be  ex¬ 
plicitly  found  during  the  Viterbi  recognition  of  the 
most  likeliest  hidden  Markovian  state-sequence. 

I.  Introduction 

The  Viterbi  algorithm  (VA)  [1]  allows  to  find  the  most  likeliest 
state-sequence  (MLSS)  of  a  finite  hidden  Markov  chain  (HMC) 
{fit}  indirectly  observed  through  a  process  {zt},t  =  0, . . . ,  AT. 
An  optimal  rule  for  the  VA  can  be  found  via  maximizing 
the  next  additive  criterion  In  P{fio^,  2^f„}  by  a  dynamic  pro¬ 
gramming  (DP)  method  [l]-[4].  Then  the  Viterbi  recogni¬ 
tion  of  ho  or  the  optimal  segmentation  of  the  observations 
^o~^  can  be  obtained  by  the  backtracking  t  =  A  —  1, ...  ,0: 
fit  =  ki+i(hi+i),  where  fitv  =  argmax/,,,,  d{hN)  and  d(-)  is  the 
corresponding  additive  functional  for  this  DP  problem. 

The  direct  implementation  of  DP  requires  to  store  the  val¬ 
ues  of  fit  what  fills  up  a  table  K(m  x  N)  with  columns  of 
back  pointers  kt  :  Ht  — t  Ht-i,t  =  N,N  -  1,...  with 
Hn  =  H  =  (0, 1, . . . ,  m  —  1}  but  if  for  a  some  moment 
s,3j  €  H:  ki+i(Hg+i)  =  j  for  all  ht  £  Ht  =  H,t  >  s,  then 
h,  =  j  is  called  a  special  column  (SC)  in  the  table  K 
of  optimal  DP  decisions  [2],  [3], 

Then  the  moments  of  the  SCs  appeeiring  are  the  most  in¬ 
formative  stopping  times  (MISTs)  for  the  Viterbi  recog¬ 
nition  of  HMS  [4]  because  after  their  appearing  further  obser¬ 
vations  don’t  change  the  previous  decisions  of  the  VA. 


II.  Main  Results 

1.  The  space  of  decision  for  the  VA  has  the  same  structure 
as  in  Sequential  analysis:  the  regions  of  acceptance  hypothe¬ 
ses  correspond  to  the  regions  of  the  SCs  appearing  as  well 
as  the  region  of  continuation  which  is  located  in  the  mid¬ 
dle.  The  bounds  of  these  regions  have  a  representation  via 
mini  Inpji/pii  [3],  [4]. 

2.  For  a  HMC  with  two  states  and  the  matrix  of  transition 
probabilities 

types  of  the  back  pointers  decisions: 

i)  Identical  decisions:  p  +  q  >  1,A  =  ln(l  -p)/q,B  = 
lnp/(l  -  q).  If  =  dtil)  -  dtiO)  <  A,  then  kt+i{ht+i)  = 
ht=Ti  =  0,  k(hr,-s)  =  0,  s  =  1, . . . ,  Ti  -  Ti_i  -  1.  If  Dl°  >  B, 
then  fit+i(fit+i)  =  ht=Ti  =  1,  kihn-a)  =  l,s  =  l,...,ri  — 

Ti-l  -  1. 

ii)  Alternate  decisions:  p  +  q  <  1,A  =  lnp/(l  —  q),B  = 
ln(l  -p)/q.  If  D}”  <  A,  then  kt+i(ht+i)  =  ht^n  =  0,  and 


kihr,-s) 


if  s  =  2r  —  1 
if  s  =  2r, 


If  Dl°  >  B,  then  kt+iiht+i)  =  ht=r  =  1,  and 


,,,  /  0,  if  s  = 

k{hr,-s)=^  U  if.= 


=  2r-  1 
2r-, 


r  =  1,...  ,Ti  -  S  <  Ti_i. 

iii)  Immediate  decisions  (RCO  =  il)):  p  +  q=  1,A  =  B  —  0, 
(the  underlying  MC  is  degenerated  into  the  independent  trials 
and  the  SCs  appear  at  each  observation.) 

Thus,  (i)  For  m  =  2  to  store  the  intermediate  information 
of  back  pointers  is  not  necessary;  (ii)  One  can  get  the  same 
Viterbi  recognition  for  different  HMM;  (iii)  For  m  >  3  the 
VA  can  be  analyzed  as  m-1  dimensional  random  walk  on  the 
underlying  Mcirkov  chain  with  m  states. 

3.  For  an  anticircle  HMC  with  one  ergodic  class  the  SCs 
appears  infinitely  often  a.s.  and  the  mean  and  variance  of  the 
time  of  the  SCs  appearing  can  be  estimated  in  many  important 
cases  via  the  analogues  of  the  Wald’s  identities  for  random 
walk  on  a  Markov  chain. 

4.  As  in  the  sequential  analysis  can  estimate  the  error  of 
Viterbi  recognition  [4].  If  P,^{t(A,B)  <  oo}  =  1}, i  =  0, 1 
and 

oit(A,B)  <  1,  pr(A,B)  <  1,  then  ln“/(l  -  0)  <  A,B  < 
ln<‘-“>  /0.  But  here,  in  duality  to  the  sequential  test  of  sta¬ 
tistical  hypotheses,  the  constants  A  and  B  are  given. 

5.  The  duality  between  the  Wald’s  sequential  analysis 
and  the  VA  allows  us  also  to  represent  the  classical  sequential 
problems  such  as  testing  of  two  simple  hypotheses  (TTSH)  or 
change-point-distribution  detection  (CPDD)  via  the  VA. 

(i)  P  =  ^  ^  ~  2  1  ^  j  ,  1  >  e  >  0,  for  TTSH. 

When  (c  — y  0),  the  VA  recognizes  the  true  Markov  state- 
sequence  and  therefore  the  true  hypothesis  with  great  accu¬ 
racy.  In  this  case  the  bounds  of  the  region  of  observations 
tend  to  ±00  as  e  — y  0,  so  the  first  and  second  kinds  of  errors 
tend  to  0. 


(ii) P=^^^^  j,l  >p,e>0,forCPDD. 

(iii)  P  =  (  ^  ^^^Vfora  Periodical  chain. 


6.  The  renewal  properties  of  the  MIST  sequence  can  be 
used  for  the  regenerative  stochastic  simulation  for  the  VA  and 
estimation  of  unknown  parameters  of  a  HMM  by  a  segmental 
K-means  recognition. 
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Abstract  —  A  multiresolution  model  for  Gauss 
Markov  random  fields  (GMRF)  is  presented.  Based 
on  information  theoretic  measures,  techniques  are 
presented  to  estimate  the  GMRF  parameters  of  a  pro¬ 
cess  at  coarser  resolutions  from  the  parameters  at  fine 
resolution. 


I.  Introduction 

There  has  been  an  increasing  interest  in  using  statistical  tech¬ 
niques  for  modeling  and  processing  images  in  the  computer 
vision  community.  Most  of  the  research  has  been  restricted  to 
Markov  random  field  models,  rightly  so,  because  of  the  local 
statistical  dependence  of  images.  The  mam  drawback  of  MRF 
techniques  is  that  the  associated  optimization  schemes  are  it¬ 
erative  and  are  usually  computationally  expensive.  One  way 
to  reduce  the  computational  burden  is  to  use  multiresolution 
techniques  [1],  [2].  In  this  paper,  we  present  multiresolution 
models  for  Gauss  Markov  random  fields. 


II.  Multiresolution  Models 
Let  =  {(*,;•)  :  0  <  j  <  M  -  1,0  <  i  <  M  -  1}  be  a 
lattice  on  which  a  GMRF  is  defined.  The  superscript  stands 
for  the  level  in  the  image  pyramid,  is  the  lattice  at  the 
fine  resolution  and  represents  the  lattice  that  is  obtained 
by  subsampling  k  times.  The  elements  of  >  are  in¬ 

dexed  by  a,  where  a  =  (ai,a2).  Let  represent  a  random 
vector,  obtained  by  ordering  the  random  variables  on  the  two- 
dimensional  lattice  through  a  row-wise  scan  .  Let  X^°'> 

be  modeled  by  a  GMRF,  then  the  joint  probability  density 
function  of  A*”’  can  be  written  as  follows: 


p(o)(^(o)  ^  a;)  = 


(27r)^(detE('»)^ 


where  is  the  covariance  matrix  of  Equivalently,  the 

process  can  be  written  in  terms  of  a  non-causal  inter- 

polative  representation  with  a  neighborhood  q*®': 


where  ei®',  is  zero  mean,  spatially  correlated  Gaussian  noise 
with  variance 

Hence  a  GMRF  process  can  be  completely  characterized 
by  the  set  of  parameters  {@,<7^}.  It  can  be  shown  that  GM- 
RFs  lose  Markovianity  on  subsampling  resolution  transforma¬ 
tion.  However,  if  lower  resolution  data  are  modeled  by  the 
exact  non-Markov  Gaussian  measures,  conventional  optimiza¬ 
tion  techniques  based  on  Markov  properties  cannot  be  em¬ 
ployed.  We  present  two  methods  to  estimate  the  parameters  ot 

^This  work  was  supported  in  part  by  the  National  Science  Foun¬ 
dation  under  Grant  #ASC  9318183 


Markov  approximations  at  coarser  resolutions.  ^etP'  HX  ) 

be  the  non-Markov  pdf  at  feth  resolution  and  Pg  (A*  )  be 
the  family  of  Gauss  Markov  pdfs.  Assuming  a  neighborhood 

(1)  a  GMRF  approximation  can  be  obtained  by  minimizing 
£)[p(fc)(A(*'^)  II  Pp*'^(A**'^)],  where  D(.||-)  is  the  Kullback- 
Leibler  distance.  It  can  be  shown  this  computation  is  very  sim¬ 
ilar  to  the  conventional  maximum  likelihood  estimation  of  the 
parameters  except  that  instead  of  using  sample  covwiances, 
this  uses  covEiriances  calculated  with  respect  to  P  (A  ) 

meaisure.  ... 

(2)  a  GMRF  approximation  can  be  obtained  by  minimiz¬ 

ing  D[p(''>(Ai''V^i+l)  II  re7,(*>._We 

call  this  local  conditional  distribution  invariance  approxima¬ 
tion.  It  can  be  shown  that  this  reduces  to  a  form  similar  to 
the  psuedo  likelihood  parameter  estimation,  again,  uses  co- 
variances  calculated  with  respect  to  measure.  In 

this  case,  a  closed  form  solution  can  be  obtained  for  the  pa¬ 
rameters,  but  if  the  resulting  parameters  do  not  satisfy  the 
positivity  conditions  [3],  simple  gradient  descent  method  can 
be  used. 

Both  methods  presented  above  require  covariance  values 
Ep(k)(Ai''^Ai+t)-  which  can  be  computed  given  the  GMRF 
parameters  for  A'®^  as  shown  below: 


r(k)  _ 


=  A, 


(0) 


J_  ^  (Ag,Ugg)(A^;A^^) 
M2  1  -  2[ 


860(0) 


where  A,  =  exp{V—l^)- 

We  have  used  these  models  for  multiresolution  texture  seg¬ 
mentation  and  have  found  that  the  multiresolution  algorithm 
performs  better  than  monoresolution  algorithms  with  lesser 
computational  requirement.  In  general,  these  multiresolution 
models  can  be  applied  for  other  low  level  image  processing 
aonlications  that  use  GMRF  models. 
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Abstract  —  Detecting  minefields  in  the  presence  of 
clutter  is  an  important  challenge  for  the  Navy.  Mine¬ 
fields  have  point  patterns  that  tend  to  be  regular  for  a 
variety  of  reasons  including  strategic  doctrine,  safety, 
tactical  efficiency,  and  perhaps  most  intriguing  the  hu¬ 
man  element.  For  example,  humans  have  a  tendency 
to  make  lottery  number  selections,  a  one-dimensional 
discrete  point  process,  in  a  non-uniform  maimer.  In 
this  paper,  we  introduce  several  simple  procedures  to 
detect  regularity  in  point  proceses. 

1.  Introduction 

The  success  of  vital  Navy  amphibious  assault  operations  de¬ 
pends  on  detecting  minefields  for  subsequent  neutralization  or 
circumvention.  Reconnaissance  data  from  the  surf  zone  can  be 
modeled  as  a  point  process  indicating  locations  where  mines  or 
more  precisely  minelike  objects  have  been  detected  by  a  sen¬ 
sor.  The  presence  of  a  minefield  produces  point  patterns  that 
tend  to  be  regular  (i.e.,  equally  spaced).  This  property  is  a 
potentially  valuable  discriminant  against  natural  clutter  (such 
as  rocks)  that  exhibit  complete  spatial  randomness  (CSR)  in¬ 
dicative  of  a  homogeneous  Poisson  process  model  (see  [3]). 

We  first  look  at  a  simple  and  intuitive  example.  Lottery 
number  selections  consists  of  n  different  integers  ,  X2, . . . ,  *n 
between  1  and  N  inclusive.  Proper  characterization  of  human 
tendencies  can  dictate  a  strategy  for  selecting  numbers  that 
mitigate  the  probability  of  multiple  winners  and  thereby  effec¬ 
tively  increases  expected  payoff.  For  example,  it  was  shown  in 
[2]  that  certain  individual  lottery  numbers  tend  to  be  selected 
significantly  more  often  than  others.  Presently,  we  focus  on 
the  interdependency  between  the  entire  sequence  of  n  selected 
numbers. 

II.  Minimum  and  Maximum  Gaps 

Consider  the  distances  or  "gaps”  between  adjacent  points 
{dj  =  Xj+i  —  Xj  :  1  <  j  <  n}.  We  hypothesize  that  humans 
tend  to  avoid  extreme  gaps  because  they  seem  ’’nonrandom”. 
This  translates  into  a  disproportionately  high  frequency  of  se¬ 
lections  without  a  low  minimum  gap  and/or  a  high  maximum 
gap.  Moreover,  we  expect  the  gap  range,  the  difference  be¬ 
tween  the  maximum  and  the  minimum,  to  be  small. 

Thb  approach  was  motivated  by  a  simple  but  rather  sur¬ 
prising  recent  observation  [1]  that  randomly  selected  lottery 
numbers  often  have  consecutive  numbers.  It  is  easy  to  show 
that  for  the  minimum  gap  1/ 

=  0) 

For  example,  the  probability  of  no  consecutive  lottery  num¬ 
bers  in  Virginia  where  JV  =  44  and  n  =  6  is  Pr{17  >  1}  =  .462. 
A  straight-forward  application  of  the  inclusion-exclusion  prin¬ 
ciple  [5]  applied  to  the  maximum  gap  V  gives 


where  the  summation  continues  over  positive  entries.  The  null 
distribution  for  the  gap  range  W  =  V  —  U  can  also  be  found. 
For  example,  for  the  Virginia  lottery  Pr{W  <  4}  =  .03  and 
E\W]  =  12.  Our  conjecture  suggests  that  the  expected  gap 
range  is  significantly  smaller  for  human  selections. 

III.  Minefield  Detection  Tests 

Consider  a  point  process  of  size  n  on  a  set  A  in  that 
has  been  partitioned  into  N  regions  of  equal  area.  Let  Mr 
denote  the  number  of  regions  containing  exactly  r  points  and 
Yk  denote  the  number  of  points  in  region  k.  If  this  process 
is  generated  by  humans  (i.e.,  minefields)  the  lottery  analogy 
leads  one  to  suspect  less  empty  regions  (smaller  gaps)  than 
under  a  CSR  model. 

The  empty  boxes  test  (EBT)  based  on  Mo  has  traditionally 
been  used  to  detect  the  presence  of  too  many  empty  boxes  as 
an  indication  of  lack  of  fit  (see  [4]).  In  these  terms,  humans 
tend  to  overfit.  Dividing  A  into  increasing  number  of  regions 
and  plotting  the  normalized  EBT  statistic  at  each  scale  pro¬ 
duces  a  curve  similar  to  the  K-function  (see  [3]).  However,  the 
EBT  approach  can  be  more  flexible  and  lacks  edge  effects  and 
independence  assumptions. 

IV.  Too  Likely  Likelihood  Tests 

The  joint  distribution  of  5^1,72 . Tv  is  multinomial  un¬ 

der  CSR.  In  particular, 

n 

log  /(s/1,3/2,  -  •  • ,  3/v)  =  log  n!  -  n  log  A  -  ^  Mr  log  r!  (3) 

r=2 

SO  that  even  distributions  of  the  points  among  the  regions  are 
more  likely  than  uneven  distributions.  This  seemingly  creates 
a  paradox  with  EBT.  For  example,  the  most  likely  value  of  (3) 
under  CSR  when  n  =  N  corresponds  to  Mo  =  0  (one  point  in 
each  region)  for  which  EBT  would  reject  CSR  with  the  lowest 
possible  p-value. 

In  this  context,  EBT  is  an  example  of  a  ”too  likely”  likeli¬ 
hood  test  (TLLT).  Without  specifying  an  alternative,  a  TLLT 
rejects  Ho  for  high  values  of  log  f{T)  where  /  is  the  null  dis¬ 
tribution  for  a  statistic  T.  For  the  minefield  detection  scenario 
and  large  values  of  N,  the  mean  and  variances  of  the  TLLT 
test  statistic  can  be  estimated  using  a  Poisson  approximation. 
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Abstract  -  New  distortion  measures  are  derived 
from  a  recently  proposed  characterization  function 
of  stationary  time  series  and  are  shown  to  be  more 
robust  than  some  commonly-used  distortion  mea¬ 
sures  such  as  the  Kullback-Leibler  spectral  diver¬ 
gence  in  speech  processing. 

I.  Introduction 

Distortion  measures  are  widely  used  in  speech  process¬ 
ing  to  quantify  the  deviations  of  speech  signals  in  cor¬ 
relation  structure,  and  among  the  most  successful  ones 
is  the  Itakura-Saito  (IS)  distance  of  spectral  densities 
[2],  also  known  as  the  Kullback-Leibler  information  di¬ 
vergence  [3].  Although  in  many  cases  the  IS  distance 
is  quite  effective  in  discriminating  signals  and  detect¬ 
ing  special  changes,  its  lack  of  robustness  is  also  well 
known  documented  in  the  literature,  especially  when 
the  signals  are  mixtures  of  narrow  and  wide  band  com¬ 
ponents  such  as  voiced  speech  waveforms  (e.g.,  [1]).  On 
the  basis  of  a  method  called  parametric  filtering,  we 
propose  some  new  distortion  measures  that  are  shown 
to  be  more  robust  than  the  IS  distance. 

II.  New  Distortion  Measures 

Given  a  zero-mean  stationary  signal  {Xt},  the  para¬ 
metric  filtering  method  characterizes  the  correlation 
structure  of  {X^t}  by  the  demodulated  first-order  au¬ 
tocorrelation  of  the  form 

'reiv)  :=  3?{e-'V(a)}  (-1  <  ^  <  1), 

where  p{a)  is  the  first-order  autocorrelation  of  the  fil¬ 
tered  signal  Xt(a)  :=  aXt-i{a)  +  Xt  with  a  :=  T)e  *®. 
Among  other  interesting  properties  of  79(17),  it  can 
be  shown  [4],  [5]  that  79(77)  uniquely  determines  the 
correlation  structure  of  {Xt}  for  almost  any  6  and  is 
infinitely  differentiable  in  77  6  (—1,1)  even  for  mixed- 
spectrum  signals  of  which  the  spectral  density  does  not 
exist.  Using  these  properties,  we  define 

pg{ri)  I  [79(^)  +  ileiVa)  +  1)  Hv  -  Va) 

+  (I-  -  leirib))  -  Vb)], 
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for  any  -1  <  77a  <  776  <  1,  where  S{ti)  is  the  Dirac 
delta.  Clearly,  the  function  pg{ri)  forms  a  (general¬ 
ized)  probability  density  in  [770,776]  and,  because  of  its 
equivalence  to  79(77),  uniquely  determines  the  correla¬ 
tion  structure  of  {Xt}  for  almost  any  6.  Therefore,  we 
can  define  the  following  distortion  measure  using  the 
Kullback-Leibler  information  divergence  [3],  namely 

«(P9lb9)  :=  /  Peiv)K{pe{v)/Peiri))dr], 

rVb 

ii(Pg\pl)  ■=  /  K{p°gir])/plir]))dT], 

Jna 

where  K{u)  :=  li  -  logu  -  1.  Many  other  distortion 
measures  can  be  defined,  for  instance,  from  the  family 
of  Renyi’s  information  [6]. 

The  IS  spectral  distance  is  known  to  be  extremely 
sensitive  to  deviations  of  individual  spectral  peaks 
while  less  so  to  changes  of  overall  spectral  shapes  (or 
envelopes).  The  new  measures  nijfigWpg)  and  nipeipl) 
are  potentially  more  robust  than  the  IS  distance  be¬ 
cause  they  are  finite  even  when  the  spectral  support 
changes.  With  this  property,  the  new  measures  are 
able  to  avoid  the  disproportional  sensitivity  to  fre¬ 
quency  shifts  and  spectral  peaks,  and  thus  to  discrimi¬ 
nate  correlation  structures  by  treating  the  discrete  and 
continuous  components  on  an  equal  basis. 

References 

[1]  R.  Andie-Obrecht,  “A  new  statistical  approach  for  the 
automatic  segmentation  of  continuous  speech  signals,” 
IEEE  Trans.  ASSP,  vol.  36,  pp.  29-40,  1988. 

[2]  F.  Itakura  and  S.  Saito,  “A  statistical  method  for  esti¬ 
mation  of  speech  spectral  density  and  format  frequen¬ 
cies,”  Electron.  Commun.  Japan,  vol.  53-A,  pp.  36-43, 
1970. 

[3]  S.  Kullback,  Information  Theory  and  Statistics,  New 
York:  Dover,  1968. 

[4]  T.  H.  Li,  “Discrimination  of  time  series  by  parametric 
filtering,”  Tech.  Rep.  212,  Dept,  of  Statistics,  Texas 
A&M  Univ.,  College  Station,  1994. 

[5]  T.  H.  Li  and  J.  D.  Gibson,  “Discriminant  analysis  of 
speech  by  parametric  filtering,”  Proc.  28th  Conf.  In¬ 
form.  Sci.  Syst,  1994. 

[6]  E.  Parzen,  “Time  series,  statistics,  and  information,” 
in  New  Directions  in  Time  Series  Analysis,  Pt.  I,  D. 
Brillinger  et  al.  Eds.,  New  York:  Springer;  pp.  265- 
286,  1992. 


92 


Nonparametric  kernel  estimation  for  error  density 
Zhu  Yu  Li  and  Shu  Zhao  Zou 
Dept,  of  Math-  ,  Sichuan  University,  Chendu,  China,  610064 


Consider  a  linear  model 

Vi  =  x'iP  Bit  t=l,2,**’,  (1) 

x'is  are  p(^l)dimension  known  vectors  and  3(GR‘’)is  an  unknown  parametric  verc- 
tor  and  et  are  assumed  i-  i.  d.  r.  v- ' s  from  a  common  unknown  desity  function  f  (x) 
with 

mei  (ci)  =  0  (2) 

Based  on  LAD  (Least  Absolute  Deviations) estimator  3  of  3»  we  propose  a  nonpara¬ 
metric  method  to  estimate  unknown  f(x).  A  kernel  estimator  fnCx)  is  obtained  as 

/.(z)  =  X  e  R*,  (3) 

residuals  ei=yryi,  hn  is  a  postive  number,  called  as  window  width,  k(  •  )  is  a  Borel 

measurable  function  on  Large  sample  properties  of  fn(x)  are  studied.  Some  com¬ 
putational  esamples  are  also  given. 
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Abstract  —  A  comparative  analysis  of  three  neural 
network  models:  Backpropagation  (BPP),  Bidirec¬ 
tional  Associative  Memory  (BAM)  and  Holographic 
Associative  Memory  (HAM);  and  a  classical  method 
for  error-correction  is  presented.  Each  method  is 
briefly  described,  results  are  reported  and  Anally 
some  advantages  are  concluded. 

I.  Introduction 

Error  correction  is  an  important  topic  in  any  digital  commu¬ 
nication  system.  Classical  methods  for  error-correction  are 
usually  based  in  Hamming  distance  techniques.  The  use  of 
new  technologies  like  neural  networks  is  presented  as  an  op¬ 
tion  for  those  who  need  to  solve  the  error-correction  problem 
in  an  alternative  way. 

II.  Description 

For  the  classical  methods,  the  linear-block  code  is  used.  In  this 
method,  the  capability  for  error-correction  is  a  function  of  the 
Hamming  distance  in  the  sense  that  there  exist  a  theoretical 
limit  that  could  be  reached  for  error-correcting  depending  only 
in  the  minimum  Hamming  distance  between  the  codewords. 
The  BPP  method  is  designed  as  a  multilayer  feedforward  net 
based  in  a  supervised  learning  model.  The  net  is  trained  with 
a  predefined  set  of  all  the  cases  involved  for  the  correction; 
in  other  words,  for  a  (n,k)  code,  all  its  combinations  of  one 
error  must  be  trained  [1].  All  these  input-output  pairs  travel 
along  the  layers  into  the  output  layer  and  then  are  compared 
with  the  desired  output  value  constructing  an  error  signal  for 
each  output  unit.  In  this  moment,  the  error  signals  are  back- 
propagated  along  the  net.  This  process  is  repeated  until  a 
steady  state  is  reached.  Once  trained  the  net,  new  patterns 
are  introduced  and  a  response  is  obtained. 

The  BAM  consists  in  two  layers  of  processing  elements  com¬ 
pletely  interconnected  between  them.  In  the  BAM’s  architec- 
tiure  there  are  weights  associated  to  the  connections  between 
processing  elements  forming  a  matrix.  This  matrix  is  used  to 
obtain  the  recall  of  the  information  when  new  data  are  tested. 
BAM  is  capable  of  reconstructing  noisy  data.  The  bidirec¬ 
tional  nature  of  the  BAM  occurs  diuring  the  recall  process[2]. 
Once  trained  the  net,  testing  data  are  introduced  to  the  BAM. 
This  data  me  propagated  along  the  two  layers  and  an  output  is 
generated.  The  output  is  propagated  backwards  and  the  out¬ 
come  is  compared  with  the  previus  input.  If  no  error  exists 
between  them,  the  recall  obtained  is  the  last  output  gener¬ 
ated,  otherwise  the  process  is  repeated  until  a  steady  state 
is  reached.  The  convergence  of  the  recall  is  waranteed  with 
a  Lyapunov  function  involved  in  the  stability  of  the  system. 
The  theoretical  limit  for  error-correction  was  reached  with  the 
BAM. 

The  HAM  bases  its  operation  in  the  principle  of  optical 
holography  of ’’enfolding”  information  of  different  phase  in  a 
single  plain  [3].  The  way  this  analogy  occurs  is  clearly  shown 
in  the  ability  of  the  HAM  to  superimpose  multiple  stimulus- 
response  associations  onto  the  identically  same  correlation  set 


representative  of  synaptic  connections  within  the  neuron  cell 
in  a  complex  number  domain.  The  external  field  of  accepted 
words  in  the  alphabet  is  transformed  to  a  complex  plane  by 
means  of  a  sigmoidal  function.  This  encoded  data  is  set  in  a 
matrix  representation  for  training  the  net.  A  new  stimulus  set 
is  presented  to  the  HAM  for  testing.  The  HAM  calculates  the 
minimum  difference  between  the  trained  data  and  the  tested 
one.  A  response  is  generated  with  the  contribution  of  the 
difference  between  vectors  and  the  closest  desired  output. 

III.  Results 

Different  results  were  obtained  for  the  three  neural  networks 
used.  The  results  obtained  with  the  BPP  net  allowed  us  to 
decode  the  90  per  cent  of  the  cases  when  testing  the  net  with 
the  predefined  input  patterns(128  patterns[l,4]). 

For  an  specific  (7,4)  code  with  minimum  Hamming  distance 
equal  to  three,  one  error  was  corrected  with  a  BAM  trained 
with  only  16  words  allowed  in  the  alphabet. 

HAM’s  results  show  that,  as  in  the  BAM’s  case,  the  theoret¬ 
ical  limit  of  one  error  corrected  was  reached  with  a  minimum 
of  16  alphabet  words  trained  for  the  (7,4)  code.  Its  important 
to  notice  that  the  HAM  also  corrected  more  errors  than  the 
theoretical  limit  in  60  per  cent  of  the  cases. 

IV.  Conclusions 

This  work  shows  that  neural  methods  employed  for  error- 
correcting  presents  an  alternative  for  other  error-correcting 
techniques  with  the  advantage  of  its  simplicity  of  program¬ 
ming  and  in  its  good  correcting  rates. 
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Abstract  —  This  paper  deals  with  a  new  type  of 
covert  channel  problem  that  arose  when  we  designed 
a  multilevel  secure  computer  (MLS)  system,  using 
a  quasi-secure,  asynchronous,  communication  device 
called  the  Pump.  We  call  this  new  type  of  covert  chan¬ 
nel  a  statistical  channel.  It  is  our  hope  to  get  feedback 
from  experts  who  work  in  the  intersection  of  informa¬ 
tion  theory  and  statistics. 

I.  Introduction 

In  a  (MLS)  system.  Low  may  write  to  High,  and  High  can 
read  from  Low,  but  High  must  never  be  able  to  write  to  Low. 
However,  in  a  MLS  system,  the  need  for  an  acknowledgement 
(ACK),  which  is  a  write  from  High  to  Low,  to  a  message  sent 
by  Low  to  High  can  violate  the  multilevel  security  policy  by 
creating  a  covert  (communication)  channel. 

Consider  a  case  where  Low  sends  messages  to  High.  A 
simple  approach  that  does  not  allow  High  to  send  an  ACK 
to  Low  places  a  buffer  between  Low  and  High.  Low  submits 
messages  to  the  buffer,  the  buffer  sends  the  ACKs  back  to 
Low,  and  High  then  takes  messages  from  the  buffer.  If  the 
Low  (sending)  rate  is  faster  than  the  High  (receiving)  rate, 
Low  will  write  over  unread  data  in  the  buffer  (since  the  buffer 
is  finite).  An  obvious  solution  to  this  problem  is  to  not  allow 
Low  to  send  messages  until  there  is  a  space  in  the  buffer.  This, 
however,  results  in  a  large  capacity  covert  channel  between 
High  and  Low  (if  Low  is  not  allowed  to  send  messages  to  a 
full  buffer,  then  High  can  send  symbols  to  Low  by  removing 
or  not  removing  messages  from  the  buffer  and  hence  causing 
the  buffer  to  be  full  or  to  have  space  on  it). 

II.  The  Pump 

Our  approach,  the  Pump  [1],  still  places  a  buffer  (size  n) 
between  Low  and  High,  but  has  the  buffer  give  ACKs  at  prob¬ 
abilistic  times  to  Low  based  upon  a  moving  average  of  the  past 
m  High  response  times  (/fm;)-  A  high  response  time  is  the 
time  from  when  the  buffer  tells  High  that  it  has  a  message  to 
the  time  when  High  actually  removes  it.  This  has  the  dou¬ 
ble  benefit  of  keeping  the  buffer  from  filling  up  and  having  a 
minimal  negative  impact  upon  performance. 

Using  a  moving  average  is  a  very  important  part  of  the 
Pump.  However,  it  gives  rise  to  a  new  type  of  timing  channel 
(for  detail,  see  [1]).  We  will  now  sketch  an  implementation 
of  the  Pump.  Let  Ov  be  the  communication  overhead  for  the 
Pump.  By  this  we  mean  that  Ov  is  the  minimum  value  for 
any  Li  (which  is  the  ith  response  to  Low).  The  Li  are  given 
by  a  random  variable  that  has  the  density  function  fi(t). 


Pump 


There  are  two  caises  to  discuss: 


Case  1:  The  buffer  is  not  full. 


fiit)  - 


{ 


aic 


~-Qi  (t^Ov  ) 


0, 


if  Ov  <  t, 
otherwise. 


The  mean  of  the  above  density  function  is  Ov  +  l/a,.  Since 
we  wish  for  this  mean  of  fi(t)  to  be  equal  to  the  moving 
average  of  the  last  m  High  ACK  times  (Hm.  )  we  see  that 
oii  =  —  Ov).  If  Hnti  =  Ov,  then  set  l/oi  =  «,  a  small 

number. 


Case  2:  The  buffer  is  full. 

This  case  is  not  germane  to  this  paper. 


III.  Covert  Channels 

A  timing  (covert)  channel  exists  when  the  output  (Low) 
alphabet  consists  of  the  different  times  of  the  same  response, 
these  different  times  (e.g.,yes  sirriving  at  3f  or  5t)  being  due 
to  High  behaviour.  Historically,  work  on  timing  channels  has 
used  very  simple  tools  from  information  theory,  for  example 
[2].  In  the  course  of  our  work  we  have  come  upon  a  new 
type  of  timing  channel  that  defies  analysis  by  our  research 
community.  It  is  our  hope  that,  by  presenting  a  paper  at  this 
workshop,  we  will  get  feedback  from  experts  who  work  in  the 
intersection  of  information  theory  and  statistics. 

We  introduce  a  new  subspecies  of  timing  channel  referred 
to  as  a  statistical  channel.  The  Low  alphabet  consists  of  dif¬ 
ferent  time  values  and  these  time  values  are  given  by  a  random 
variable  with  certain  parameters  and  these  parameters  are  de¬ 
pendent  upon  High  actions. 

Definition  1  If  High  can  affect  a  parameter  in  the  distribu¬ 
tion  of  some  system  response  time  to  Low,  we  say  that  there 
is  a  statistical  channel  between  High  and  Low. 

In  the  Pump,  High  can  modify  the  moving  average  by  af¬ 
fecting  the  last  m  time  values  of  High’s  responses  to  the  Pump. 
It  is  possible  for  Low  to  detect  differences  in  High’s  actions 
by  trying  to  guess  what  the  moving  average  is.  This  creates 
a  statistical  channel  and,  therefore,  insecurity.  For  now,  let 
us  forget  that  the  exponential  density  h2is  been  shifted  by  the 
communication  overhead  time,  and  simply  view  the  inputs  to 
the  channel  as  the  High  response  times.  We  state  a  simpler 
form  of  our  problem  as: 

What  is  the  capacity,  in  bits  per  unit  time,  of 
a  communication  channel  where  the  output  is  an 
exponential  random  variable  whose  mean  is  the 
moving  average  of  the  past  m  input  times? 
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Abstract  —  Asymptotic  (high  rate)  quantization  the¬ 
ory  is  applied  to  the  multivariate  mismatch  problem. 
This  means,  the  question  is  addressed  how  much  is 
lost  if  a  vector  quantizer  which  is  matched  to  a  specific 
source  with  given  parameters  is  used  for  quantization 
of  a  source  with  different  parameters.  For  parame¬ 
terization  of  the  sources  sub-Gaussian  processes  are 
employed. 

I.  Introduction 

Vector  qucintization  (VQ)  is  an  often  used  method  of  lossy 
source  coding.  Usuedly,  a  vector  quantizer  is  optimized  by  a 
training  sequence  which  is  expected  to  represent  8J1  the  sta- 
tistic^Ll  characteristics  of  the  samples  to  be  quantized.  In  the 
most  practical  cases,  however,  it  is  impossible  to  find  such 
a  training  sequence,  because  real  source  signals  show  time- 
varying  statistics.  Therefore,  vector  quantization  is  often  im¬ 
plicitly  coupled  with  the  mismatch  problem. 

Performance  evaluations  of  mismatched  vector  quantizers 
are  interesting  not  only  from  an  information  theoretical  point 
of  view.  They  also  give  clues  to  design  a  robust  quantizer  un¬ 
der  the  knowledge  that  the  actual  source  statistics  are  vary¬ 
ing.  If  mismatch  is  eventu2illy  vmavoidable,  a  vector  quan¬ 
tizer  should  be  designed  under  such  conditions  that  mismatch 
around  the  operation  point  shall  only  weakly  affect  its  opti¬ 
mum  performance. 

Only  few  results  concerning  mismatched  vector  quantizers 
are  reported.  The  main  re^^son  being  the  lack  of  an  appropri¬ 
ate  comprehensive  multivariate  model.  The  situation  changed 
when  the  engineering  commimity  bec2ime  aweire  of  the  class  of 
spherically  invariant  reindom  processes  (SIRP)  and  developed 
parametric  source  models  [1][2].  SIRPs  have  the  property  that 
they  are  completely  described  by  the  univariate  (marginal) 
density  fimction  and  the  linear  statistical  dependencies  (co- 
variances  or  covariations)  between  the  random  variables. 

More  important  from  a  practical  point  of  view,  however, 
is  that  SIRP  models  reflect  the  statistics  of  a  wide  variety  of 
sources.  Band  limited  telephone  speech  samples  [1],  mean- 
removed  image  blocks  [2],  subband  image  statistics  [3]  and 
prediction  error  images  [4]  show  ellipticeJly  shaped  bivariate 
distributions,  thus  Mlowing  SIRP  modeling. 

11.  Mismatched  Vector  Quantization  of 
Sub-Gaussian  Sources 

In  [5]  a  new  SIRP-model  has  been  developed  which  employs 
symmetric  stable  densities  [6]  as  marginal  densities.  Symmet¬ 
ric  stable  densities  are  defined  as  densities  having  a  chmacter- 
istic  function  of  the  form: 

(f>{t)  =  exp(— 7|t|“),  with  7  <  0,  0  <  a  <  2.  (1) 

The  stable  distribution  has  much  thicker  tails  than  e.g.  the 
Gaussian,  thus  allowing  to  model  real  world  phenomena  in¬ 
cluding  outliers  accurately  with  the  aforementioned  class  of 


SIRPs.  Moreover  it  has  been  shown  in  [5]  that  the  cIeiss  of 
SIRP-processes  with  symmetric  stable  densities  is  identical  to 
a  cIeiss  of  processes  termed  sub-Gaussian  processes  in  math¬ 
ematics  statistics.  Sub-Gaussian  processes  are  completely 
parameterized  by  a  shape  parEimeter  (called  characteristic  ex¬ 
ponent)  and  a  covariation  matrix.  With  the  characteristic 
exponent  the  shape  of  the  distribution  can  be  varied.  The 
covariation  matrix  plays  an  anSogue  role  as  the  covarieince 
matrix  in  the  classicS  second-order  process  theory.  With  the 
covariation  matrix  the  variation  (i.e.  the  stable  anSogue  to 
the  variance)  as  well  as  the  linear  dependencies  between  the 
samples  can  be  adjusted. 

Since  the  symmetric  a-stable  distribution  (1)  is  completely 
specified  by  only  two  parameters  (stable  exponent  a  and  vari¬ 
ation  7)  the  mismatch  problem  can  be  formulated  and  solved 
in  terms  of  shape  and  variation  mismatch.  Applying  sub- 
Gaussian  sources  to  the  asymptotic  (high  rate  —  low  distor¬ 
tion)  qu^lntization  theory  [7],  we  evaluate  the  relative  perfor¬ 
mance  of  mismatched  vector  quEintizers  for  these  mismatch 
conditions.  So,  the  question  what  happens,  if  the  actual 
source  distribution  differs  in  shape  from  the  distribution  the 
quantizer  is  optimized  for,  can  be  answered  employing  sub- 
Gaussian  processes  as  source  model.  It  turns  out  that  the 
robustness  of  a  vector  quantizer  depends  strongly  on  the  di¬ 
mension  of  the  quantizer.  Furthermore,  vector  quantizers  re¬ 
spond  —  like  scalar  ones  —  unequally  to  mismatch  around 
their  operation  point.  However,  the  sensitivity  against  mis¬ 
match  is  reduced  with  increetsing  vector  dimension. 
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ABSTRACT  -  Bayes  decision  theory  is  based  on  the 
assumption  that  the  decision  problem  is  posed  in  probabilistic 
terms,  and  that  all  of  the  relevant  probability  values  are  known 

[1] .  The  aim  of  this  paper  is  to  show  how  blind  sliding  window 
AR  modeling  is  corrupted  by  an  abrupt  model  change  and  to 
derive  a  statistical  study  of  these  parameters. 

I  INTRODUCTION 

The  aim  of  this  paper  is  to  study  the  behaviom  of  continuously 
evolving  classification  when  applied  to  signals  presenting  an 
abrupt  change.  AutoRegressive  (AR)  parameters  are  widely 
used  to  constitute  the  observation  vector.  The  first  part  shows 
how  sliding  window  AR  modeling  is  corrupted  when  applied 
to  signals  that  change  abruptly.  The  second  part  studies  the 
evolution  of  AR  parameter  statistics  used  in  random  process 
classification. 

n  SIGNALS  PRESENTING  ABRUPT  CHANGE 
Let  us  consider  a  particular  signal  y(n)  defined  by  the 
succession  of  two  stationary  signals  yi(n)  and  yafn)  with 
an  abrupt  change  occurring  at  n-  Nr.  A  blind  sliding 
window  AR  modeling  of  y(n)  (that  is  to  say  without  any 
previously  detected  change)  leads  to  the  solving  of  a  set  of 
Yule  and  Walker  equations  [2].  The  theoretical  time 
dependent  AR  parameter  vectors  are  given  by : 

niNf  paramater  vector  of  yi(/J)) 

N,+  liniN,*L  5L„"[£n°' ■ ’O]' 

N,  +  L*\in  parameter  vector  of  yjCn)) 

L  being  the  AR  model  order  and  a  vector  composed  of 

n-Nr-1  the  Levinson  Durbin  Recursion  (LDR) 
coefficients  of  the  second  signal  yafu).  When 
N  r^  n  <  N  r  +  L  -  I ,  AR  estimation  then  leads  to  an  AR 
vector  with  only  n  -  Nr  -  1  non  zero  coefficients  [4]. 

Ill  AR  PARAMETER  STATISTICS 
We  then  study  the  case  of  random  AR  parameters.  The  a  ^ 

probability  density  function  (p.d.f)  allows  us  to  analyse  the 
evolution  of  the  class  shape  when  the  sliding  window  moves. 
Let  us  denote : 

LI*  “  [“(1. 1)’  •••  ’  “(t.*;)’  1)>  “(t.l)] 

“(i.y)  (with  ;=1 . i)  being  the  i*''  order  linear 

predictor  coefficient  estimator  of  the  2""  model,  k  varying 
from  1  to  L  such  that  n  =  N r  +  I  +  k. 

For  n  >  Yr  +  7  +  1 ,  we  get  F*  =  02-  vector  is  then 

the  second  AR  model  parameter  vector,  the  p.d.f  of  which, 
denoted  by  /ifiy* . *^i),is  assumed  to  be  known. 


This  last  hypothesis  is  not  restrictive  because  m  pattern 
recognition,  AR  parameter  statistics  which  characterises 
within-class  scattering  is  usually  assumed  to  be  known 
(generally  gaussian). 

The  next  point  of  this  study  is  to  determine  the  F  ,  p.d.f, 

denoted  by  . ui)  as  a  function  of  that  of 

varying  over  [!...£].  The  first  L-k  components  of  these 
two  vectors  are  equal.  Their  last  k  components  verify  the 
following  relations : 


These  relations  can  be  inverted  and  allow  us  to  determine  the 
jacobian  of  the  transformation  between  the  two  vectors  , 

and  K*  [3].  We  then  obtain : 
nodd : 

1 

. Vl)-(l-wJ)  '  /»(Vi . . . Ui+VtVt.i) 

seven ; 

. I'll' (1 +y»)(i -Uk)  ’  /kC^i . I'k.Vk-i  +  iykt'i . 

(l+t/k)Vk/2 . Vl+t'kl'k-i) 

With  L  recursions,  we  may  then  determine  the  statistics  of 
the  different  vectors  c^forn-A^r+1  to  n-  Nr  +  I.  These 

statistics  allow  us  to  study  the  evolution  of  the  class  shapes 
when  the  sliding  window  moves. 

CONCLUSION 

We  show  that  sliding  window  AR  modeling,  applied  to  two 
stationary  AR  signals  with  an  abrupt  change,  gives  parameters 
which  follow  the  Levinson  Durbin  Recursion.  We  give  a 
recursive  method  making  it  possible  to  find  the  probability 
density  function  of  these  parameters  when  the  sliding  window 
moves.  Class  shapes  may  then  be  described  in  a  continuously 
evolving  classification. 
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Abstract  —  The  finite-sample  risk  of  the  fc-nearest 
neighbor  classifier  that  uses  an  Lp  distance  function 
is  examined.  For  a  family  of  classification  problems 
with  smooth  distributions  in  IR",  the  risk  can  be  rep)- 
resented  as  an  asymptotic  expansion  in  inverse  pow¬ 
ers  of  the  n-th  root  of  the  reference-sample  size.  The 
leading  coefficients  of  this  expansion  suggest  that  the 
Euclidean  or  distance  function  minimizes  the  risk 
for  sufficiently  large  reference  samples. 

I.  The  ik-NEAREST-NEIGHBOR  CLASSIFIER 
Let  the  elements  of  IL  =  {1,2}  denote  two  states  of  nature, 
or  pattern  classes,  and  let  Pi  and  P2  =  i  —  Pi  denote  their 
corresponding  stationary  prior  probabilities.  Each  pattern  is 
represented  by  a  feature  vector  X,  drawn  at  random  from 
IR”.  Specifically,  patterns  originating  from  class  f  6  IL  are 
generated  by  the  stationary  conditional  distribution  Ft. 

Labeled  feature  vectors  are  generated  by  a  two-step  process. 
First,  a  class  Z  €  IL  is  chosen  at  random  so  that  P[Z  =  £]  =  Pt 
for  f  S  IL;  then  a  random  feature  vector  is  drawn  according 
to  Fl.  After  m  independent  repetitions  of  this  process,  we 
obtain  the  labeled  reference  sample, 

A’„.  =  {(X\i'),...,(X’”,X'")}. 

Given  an  Lp  metric,  and  an  arbitrary  point  x  G  IR",  the 
indices  of  the  labeled  feature  vectors  in  Xm  can  be  permuted 
so  that 


||x  -  X‘  Up  <  ||x  -  X^IIp  <  •  •  •  <  ||x  -  X’"||p.  (1) 

Here  ||x||p  =  (IxiIp -1- •  ■  •  +  for  1  <  p  <  <»,  and 

||x||oo  =  maxi<,<„  |xi|,  denote  the  Lp  norm.  The  k  nearest 
neighbors  of  x  then  form  the  subset  {(X^  i*), . . . ,  (X*,  Z*)}; 
and  the  It-nearest-neighbor  classifier  assigns  x  to  class  Z'(x)  = 
maj(Z^ , . . . ,  Z*),  viz.,  the  most  frequently  appearing  class  la¬ 
bel  in  the  subset.  (Ties,  and  degeneracies  in  (1),  can  be  re¬ 
solved  by  an  arbitrary  procedure.)  Using  this  algorithm  every 
point  in  IR"  can  be  assigned  to  a  class  in  IL. 

II.  The  Finite-Sample  Risk 
Given  a  positive  integer  k,  an  Lp  metric,  and  a  finite  ran¬ 
dom  reference  sample  Xm,  a.  single  test  vector  (X,  Z),  drawn 
independently  by  the  same  random  process,  is  assigned  to  class 
Z'  =  Z'(X)  by  the  A:-nearest-neighbor  classifier.  We  now  con¬ 
sider  the  m-sample  risk, 

R„,  =  P[Z'  =  1,  Z  =  2]  -b  P[Z'  =  2,  Z  =  1], 

^This  work  was  supported  in  part  by  Rome  Laboratory,  Air 
Force  Material  Command,  USAF,  under  grant  number  F30602-94- 
1-0010. 


for  two-class  problems  that  satisfy  the  following  smoothness 
conditions: 

Cl.  For  £  e  {1,  2},  the  class-conditional  distributions  Fi  are 
absolutely  continuous  over  IR"  and  have  corresponding 
densities  ft. 

C2.  The  mixture  density,  f  =  Pi  fi  -\-P2f2,  is  bounded  away 
from  zero  a.e.  over  its  probability-one  support  S  C  IR"- 
C3.  Each  class-conditional  density,  ft,  possesses  uniformly 
bounded  partial  derivatives  up  to  order  A  -F  1  almost 
everywhere  on  its  probability-one  support. 

C4.  One  or  the  other  of  the  class-conditional  densities  van¬ 
ishes  close  to  the  boundary  of  S. 


Theorem  1  Under  Conditions  Cl  through  C4,  there  exist 
constants  cj,  for  j  =  2,  3, . . . ,  A,  such  that 

Rm  =  Rsc  -f 

where  Rco  is  the  infinite-sample  risk  derived  by  Cover  and 
Hart  [1], 


A  proof  of  this  theorem,  including  derivations  of  the  leading 
coefficients,  wUl  be  published  separately.  (An  analogous  proof 
for  the  nearest-neighbor  classifier  (fc  =  1)  under  the  Euclidean 
metric  (p  =  2)  appears  in  a  recent  paper  [2].)  For  the  coeffi¬ 
cient  C2,  we  obtain 


C2  =  L)„(p) 


r(fc  +  i  +  |) 

24  [r(i^)]^ 


X 


-J-vVi  +  ivVz  -  .|vV 

fl  J2  J 


where 


Pt  =  Pt{x)  = 


Pr/z(x) 

/(x) 


denotes  the  posterior  probability  that  a  feature  vector  with 
value  X  originates  from  class  £.  In  the  above. 


r(|  +  i)r(j-n)'«"'"> 

r(=±i  +  i)r(J  +  .)’ 

has  a  global  minimum  at  p  =  2  for  fixed  n  >  1.  This  suggests 
that  under  the  above  assumptions  the  Euclidean  metric  is  the 
optimal  Lp  distance  function,  if  m  is  sufficiently  large. 
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Abstract  —  We  have  elucidated  the  position  of 
Woodbury’s  statistical  fuzzy  Grade-of-Membership 
(GoM)  model  in  the  unsupervised  clustering  domain. 
This  implementation  of  the  model  is  shown  to  oper¬ 
ate  not  only  on  multivariate  categorical  data,  but  on 
permuted,  or  encoded,  data  as  well. 

I.  Introduction 

We  present  results  of  a  fuzzy  unsupervised  clustering 
paradigm,  applying  Woodbury’s  [1,  2]  statistical  fuzzy  Grade- 
of-Membership  (GoM)  model  to  the  problem  of  identifying 
natural  clusters  and  statistical  structure  in  data.  Extensive 
theoretical  development  and  empirical  evaluation  of  the  GoM 
clustering  paradigm  is  presented  by  Talbot,  et.  al.  [3]. 

II.  GoM  Clustering  Paradigm 

The  GoM  model  simultaneously  estimates  profile  probability 
densities  and  memberships  for  a  fuzzy  partition.  Model  pa- 
reimeters  estimated  from  the  data  suggest  a  latent  structure 
which  may  simplify  coding,  classification,  and  other  analy¬ 
ses  of  high-dimensionality  data.  GoM  clustering  provides  a 
more  general  framework  for  data  analysis  compared  with  con¬ 
ventional  clustering  paradigms  in  many  CEises.  GoM  model  at¬ 
tributes  that  contribute  to  its  generedity  in  this  context  include 
its  operation  on  categorical  random  variables,  foundation  in 
fuzzy  set  mathematics,  and  detachment  from  distance  mea¬ 
sures.  A  major  model  attribute,  operation  on  categorical  ran¬ 
dom  variables,  contributes  to  broad  applicability — admitting 
categorical  or  even  coded  input  data  and  allowing  for  non¬ 
linear  partitions.  Because  the  data  is  represented  by  a  finite  al¬ 
phabet,  the  model  is  also  less  sensitive  to  disparate  scaling  and 
outlying  samples  in  many  cases,  A  second  model  attribute, 
its  fuzzy  set  basis,  allows  for  characterization  of  more  com¬ 
plex  sources  of  heterogeneity  in  the  data.  A  third  attribute 
of  the  GoM  model  is  its  detachment  from  distance  measures. 
By  considering  distance  only  indirectly  through  transitivity 
relationships,  the  model  elucidates  data  structures  primarily 
based  upon  characteristics  of  the  estimated  data  distributions 
rather  than  upon  distance  computations  between  points  in  the 
space.  This  detachment  from  distance  measures  not  only  pro¬ 
vides  an  unprecedented  opportunity  to  evaluate  the  statisticfJ 
composition  of  the  data  source  but  also  offers  new  insights  into 
structural  mechanisms  affecting  coding  performance. 

III.  GoM  Clustering  Examples 

GoM  clustering  performance  was  compared  with  conventional 
vector  quantization  (VQ),  fuzzy  c-means  (FCM)  clustering, 
and  deterministic  annealing  (DA)  clustering  to  highlight  dif¬ 
ferences  between  partitioning  based  upon  distance  measures 
versus  that  based  upon  statistic^  data  structure.  Continuous 
ordered  data  was  quantized  to  produce  categorical  data. 


Experimental  outcomes  for  a  two-dimensional  unit  step  ex¬ 
ample  demonstrate  that  the  GoM  model  can  provide  cin  in¬ 
tuitively  satisfying  partition  and  structural  determination  as 
well  as  excellent  background  discrimination. 

Figure  1  shows  GoM  clustering  results  for  quantized  and  en¬ 
coded  multivariate  Gaussian  data  derived  from  crisply  defined 
distributions.  The  encoding  clarifies  the  categorical  nature  of 
GoM  clustering  and  also  suggests  potential  applications  for 
analysis  of  unconventional  data  sets  which  may  be  generated 
as  the  output  of  an  encoder  or  classifier.  In  this  case,  GoM 
clustering  provides  an  ideal  partition  of  the  encoded  data  as 
well  as  density  estimates  for  sample  data  derived  from  each 
cluster.  The  log-likelihood  value  was  experimentally  shown  to 
be  a  suitable  clustering  criteria,  providing  a  strong  correspon¬ 
dence  to  performance. 


(a)  (b) 

Fig.  1:  GoM  clustering  of  multivariate  Gaussian  source:  (a)  quan¬ 
tized  data  and  (b)  quantized  and  encoded  data.  Gray  level  repre¬ 
sents  membership  in  each  of  two  clusters. 


IV.  Conclusions 

The  GoM  clustering  paradigm  supplements  existing  meth¬ 
ods  to  broaden  the  application  base  and  provide  additional 
partitioning  alternatives.  The  encouraging  results  suggest 
many  applications  in  coding  ttnd  cltissification,  especially  when 
employed  in  concert  with  conventional  techniques. 
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Abstract  -  For  a  fractional  Gaussian  noise 
model,  we  derive  asymptotics  for  minimax  risks 
and  show  that  wavelet  estimates  can  achieve  min¬ 
imax  over  a  wide  range  of  spaces.  This  article  also 
establishes  a  Wavelet  -  Vaguelette  Decomposition 
(WVD)  to  decorrelate  fractional  Gaussian  noise. 

Introduction 

Suppose  we  observe  a  function  /  from  regression 
yi  =  fixi)+£i,  «  =  l,---,n,  (1) 

where  Cj  are  zero  mean  stationary  normal  errors 
with  long  -  range  dependence. 

Long  -  range  dependence  occurs  in  many  ap¬ 
plications.  For  example,  it  happens  in  data  from 
geophysics  and  hydrology,  economical  time  series, 
biological  signals,  image  generation  and  interpo¬ 
lation,  texture  classification,  noises  in  electronic 
devices,  frequency  variation  in  music,  and  burst 
error  on  communication  channels.  Signal  pro¬ 
cesses  with  long  -  range  dependence  have  much 
more  persistent  long  term  correlation  structure 
than  the  well  studied  short  -  range  processes  such 
as  ARMA  processes  and  mixing  processes.  Tra¬ 
ditionally,  these  process  with  long  -  range  depen¬ 
dence  have  been  mathematically  awkward  to  ma¬ 
nipulate.  This  has  made  the  solution  of  many  of 
the  classical  signal  processing  problems  involving 
these  processes  rather  difficult. 

Fractional  Gaussian  noise  provides  a  useful 
model  for  phenomenon  exhibiting  long  -  range  de¬ 
pendence.  We  propose  a  fractional  Gaussian  noise 
model,  which  is  an  approximation  of  the  nonpara- 
metric  regression  model  (1),  and  then  establish 
asymptotic  results  for  minimax  risks.  Because  of 

’This  work  Wcis  in  pcirt  supported  by  NSF  Grant  DMS- 
94-04142 


long  -  range  dependence,  the  minimax  risk  and 
the  minimax  linear  risk  converge  to  zero  at  rates 
that  differ  from  those  for  data  with  independence 
or  short  -  range  dependence.  It  is  shown  that  a 
wavelet  estimate  with  resolution  level  -  dependent 
threshold  can  be  “tuned”  to  achieve  minimax  over 
Besov  bodies  with  p  <  q.  Linear  estimates  can  not 
achieve  even  the  minimax  rates  over  Besov  classes 
when  p  <2. 

The  key  to  prove  the  asymptotic  results  is  to 
decorrelate  fractional  Gaussian  noise  and  frac¬ 
tional  Brownian  motion  via  WVD  by  utilizing 
the  idea  of  simultaneous  diagonalization  through 
WVD  described  in  Donoho  (1992)  and  the  fact 
that  Fractional  Gaussian  noise  is  linked  to  frac¬ 
tional  differential  operators  which  are  almost  di¬ 
agonal  in  a  wavelet  basis. 

Decorrelation  of  fractional  Gaussian  noise  and 
fractional  Brownian  motion  via  WVD  has  its  own 
interest.  In  fractal  signal  processing,  it  is  very 
desirable  to  decorrelate  fractional  Gaussian  noise 
and  fractional  Brownian  motion  (e.g.  see  Wornel 
and  Oppenheim  (1992)).  Although  wavelets  re¬ 
duce  dependence  of  fractional  Gaussian  noise,  the 
wavelet  coefficients  of  fractional  Gaussian  noise 
and  fractional  Brownian  motion  are  correlated 
and  hence  wavelets  themselves  do  not  decorrelate 
fractional  Gaussian  noise  and  fractional  Brownian 
motion.  Fractional  Gaussian  noise  and  fractional 
Brownian  motion  can  be  decorrelated  by  WVD. 

Moreover,  we  employ  two  WVDs  to  solve  the 
following  linear  inverse  problems  in  the  presence  of 
indirect,  noisy  data  with  long  -  range  dependence 

Pi  =  {K  f){xi)  +  Ei,  i  =  l,---,n,  (2) 

where  £i  are  zero  mean  stationary  normal  errors 
with  long  -  range  dependence  and  A"  is  a  linear 
transformation. 
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Abstract  —  We  derive  two  types  of  block-wise  FNM 
model  for  pixel  images  by  incorporating  local  context. 
The  self-learning  is  then  formulated  as  an  information 
match  problem  and  solved  by  first  estimating  model 
parameters  to  initialize  ML  solution  and  then  con¬ 
ducting  finer  segmentation  through  MRF  relaxation. 


I.  Introduction 

The  main  difficulty  of  unsupervised  medical  image  analysis  is 
that  the  model  parameters  are  unknown  and  the  priori  context 
is  unobservable.  Any  noncontextual  algorithm  is  likely  to  per¬ 
form  poorly  since  locally  there  may  not  be  sufficient  informa¬ 
tion  to  make  a  good  decision.  The  spatially  dependence  among 
pixels  is  one  of  some  fundamental  concerns  and  a  reasonable 
assumption  is  that  neighboring  pixels  are  hkely  to  have  simi¬ 
lar  gray  level  and  the  same  label.  In  the  two  main  approaches 
to  this  problem,  the  MRF  model-based  techniques  are  often 
heuristically  determined  and  computationally  prohibitive  [1], 
while  the  conventional  FNM  models  only  reflect  partial  con¬ 
text  information  in  either  global  or  pixel  scale.  This  paper 
presents  a  new  self-learning  strategy  based  on  stochastic  reg¬ 
ularization.  The  originalities  are:  1)  two  types  of  block-wise 
FNM  models  are  derived  for  pixel  images  by  incorporating  lo¬ 
cal  context;  2)  a  unified  information  match  criterion  is  applied 
to  both  model  determination  and  pixel  labeling. 


II.  Multiscale  FNM  Modeling 

FNM  modeling  has  proven  to  be  a  successful  tool  for  medical 
image  analysis  that  is  mainly  due  to  the  validity  of  the  inde¬ 
pendent  approximation  of  pixels  according  to  image  statistics 
[2].  We  extend  this  framework  to  include  local  context  in 
multiscales.  Assume  a  medical  image  with  pixels  and  K 
regions.  After  dividing  the  image  into  disjoint  blocks,  the  joint 
probability  density  function  (pdf)  can  be  well  approximated 
by  a  block-wise  conditional  FNM  model  given  by 


K 

^■(■0  =  n  im 


r=l  k=l  i=l 


•v/Siri 


•  exp  (■ 


(Xrt  f^kY 
2<t2 
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where  Hk  and  erf.  are  the  mean  and  variance  of  the  fcth  region, 
c  is  the  block  size,  and  Irk  is  the  label  associated  with  the  rth 
block.  By  randomly  reordering  the  neighboring  pixels  of  the 
ith  pixel,  a  new  joint  pdf  of  pixel  images  can  be  defined  by 
introducing  the  local  context  into  the  standard  FNM  model 
in  block  form 
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«(>■) = IIEiE 


t=l  k=l 
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where  i  denotes  the  neighboring  pixel  with  the  label  and 
the  local  context  is  naturally  translated  into  non-parametric 
bindings  by  Bayesian  priori  probabilities.  The  problem  ad¬ 
dressed  here  is  the  combined  estimation  and  detection  of  the 
regional  (^/ikycf)  and  structural  {c,K)  parameters  and  the 
contextual  (lrk,lik)  variables,  given  the  observations  x. 


III.  Information  Criteria  and  Algorithms 
The  unsupervised  estimation  and  detection  can  be  character¬ 
ized  as  an  optimal  information  match  problem.  By  minimizing 
a  unified  cost  function,  we  solve  this  problem  using  stochastic 
regularization  with  two  steps.  Since  parameters  and  variables 
in  (1)  are  non-random  unknown  constants,  we  introduce  a  new 
model-fitting  procedure  derived  from  the  modified  global  rel¬ 
ative  entropy,  namely,  the  minimum  bias/ variance  criterion 
(MBVC) 


MBVC{K,c)-  ^  Qx(u)  log  — +3K-1  (3) 

where  ^  is  the  ML  estimate  of  the  parameter  vector, 
Qxiu)  is  the  image  histogram,  and  is  the  standard 

FNM.  The  balancing  of  decomposed  model  bias  and  variance 
yields 

(Ko,  Co)  =  Arjf{min  MBVC{K,  c))  (4) 

with  a  simple  optimal  appeal:  a  minimum  bias  and  variance 
model  maximizes  the  information  match  [3].  Since  (2)  treats 
pixel-based  labels  as  discrete  random  variables,  by  minimizing 
the  expected  Bayes  risk,  the  Bayesian  detection  will  classify 
pixel  i  into  region  j,  if 


(Xi  -  Hkf 
2(tI 


)}  (5) 


The  block  algorithms  take  advantages  of  the  ML  estimator 
being  regional-structural  separable  and  the  MRF  relaxation 
with  local  context  revision  consistency. 

A.  Multiple  Resolution  Block-Wise  CM  (MRBCM): 

1.  Given  c  =  c^ai  +  1 

2.  c  =  c  —  1 

•  dKlirJ)  =  log(<Tj/(Tr)  -h  [af  -  (rf  {Pr  -  liif]l2<Tf 

•  ^  =  1,  if  A;  =  Arg{mmi<j<K  dKLir,  j)} 

•  Continue  until  (l('"+i)  —  F*"))  =  o 

3.  Stop  when  minimum  global  bias  is  reached. 

B.  Local  Contextual  Bayes  Relaxation  Labeling  (LCBRL): 

1.  Given  m=:0 

2.  m=m-|-l 

•  Randomly  visit  each  i  and  calculate  Em  ^ 

•  Update  lik  according  to  (5) 

3.  Continue  until  —  g. 

Simulation  results  show  the  eflicient  and  robust  performance. 
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Abstract  —  The  formalism  behind  a  novel  addi¬ 
tive  congruential  method  is  described.  This  method 
yields  uniform  random  sequences  whose  outcomes  oc¬ 
cur  more  than  once  throughout  the  sequence. 

I.  Introduction 

An  approach  to  generating  a  uniformly  distributed  pseudo¬ 
random  sequence  (PS)  is  presented;  the  method  is  based  on 
addition  rather  than  multiplication  hence  the  name  Additive 
Congruential  Method  (ACM).  For  a  selected  prime,  p,  the  PS 
is  a  sequence  of  random  variables  (RV)  over  the  Galois  field 
GF{p).  The  ACM  yields  a  Markov  PS,  where  each  valid  out¬ 
come  appears  more  than  once  within  the  main  period,  and  still 
obeys  a  uniform  distribution  (UD).  It  is  argued  that  such  a  PS 
shows  improved  randomness  over  the  standard  Multiplicative 
Congruential  Method  (MCM)  [l]. 

II.  The  Main  Result 

The  PS  is  a  chain  of  RVs  over  GF{p),  p  prime.  Theorem  1  and 
Corollary  1.1,  stated,  for  the  purpose  of  this  paper,  without 
proof,  guarantee  a  uniform  distribution  on  outcomes. 

Theorem  1:  Let  p  be  a  prime,  p  >  2,  and  let  A:  6  K,  1  < 
<  p  _  2  Perform  all  possible  modulo-p  products  between 
k  distinct  elements  from  GF(p)\{0}.  The  number  of  occur¬ 
rences,  among  the  (”):')  results  will  be  the  same  for  each 

element  in  GF(p)\{0}  iff /:  I  (j_i)  .□  _ 

The  proof  uses  the  structure  of  GF{p)  to  write  a  finite  differ¬ 
ence  equation  with  the  distribution  on  outcomes  as  solution; 
its  unique  solution  is  the  UD.  Multiplication  translates  into 
addition  if  we  let  a  =  a‘,  t  =  0,p  -  l,Vs  £  GF(p),  a  6  GF(p) 
primitive.  The  corollary  follows. 

Corollary  1.1:  Let  p  be  a  prime,  p  >  2,  and  let  k  £ 
H  l<jfc<p-2.  Perform  all  possible  modulo-p  additions 
of  k  distinct  integers  from  GF[p)\{p  -  1}-  The  number  of 
occurrences  among  the  (*■;')  results  will  be  the  same  for  each 

element  in  GF{p)\{p  —  1}  iff  A:  l(it-i) 

The  ACM  relies  on  Corollary  1.1.  If  we  calculate  the  em¬ 
pirical  probability  transition  matrix  (PTM),  we  see  that  it  is 
doubly  stochastic.  Clearly,  this  is  due  to  the  UD  on  outcomes. 
Thus  we  may  view  the  PS  as  a  realization  of  some  time  invari¬ 
ant  Markov  process  (TIMP)  with  a  doubly  stochastic  PTM. 
A  TIMP  approaches  a  UD  iff  its  PTM  is  doubly  stochastic. 

Definition  1:  Model  the  ACM  generated  PS  as  a  realiza¬ 
tion  of  some  TIMP  with  doubly  stochastic  PTM.  A  natural 
measure  of  randomness  for  the  ACM  sequence  is  the  degree  of 
randomness  of  the  associated  Markov  process. 

In  order  to  use  Definition  1,  we  need  to  define  a  measure  of 
randomness  for  the  TIMP  with  doubly  stochastic  PTM.  Such 
a  measure,  conjectured  to  be  weU-defined,  is  suggested  by  the 
following  argument,  which  needs  to  be  formalized.  Contrary 
to  the  MCM  where  once  an  outcome  has  occurred  an  observer 
can  count  on  the  fact  that  it  will  NOT  occur  again  within  the 
main  period,  in  the  ACM  PS  there  are  multiple  occurrences 


This  is  perceived  as  better  randomness  yet  the 
distribution  on  the  states  of  the  TIMP  is  still  not  uniform, 
although  the  ensemble  distribution  is  uniform.  Consider  the 
situations  when  an  arbitrary  integer  in  GF{p)  can  be  followed 
by  (1)  some  integers  in  GP’(p),  but  not  all  of  them  and  (2) 
any  integer  in  GF(p).  Clearly,  the  latter  PS  is  more  random 
since  we  are  less  able  to  ‘predict’  the  next  outcome.  It  can  be 
shown  that  if  P  is  a  doubly  stochastic  matrix  of  order  m  t^n 
P”  n  G  H  are  stochastic  and  all  entries  in  Poo  =  liinn_*oo  P 
are’equal.  But  P"  is  the  PTM  of  a  Markov  process  described 
by  P  after  decimating  by  n;  hence  retaining  every  nth  sanuple 
achieves  a  more  uniform  distribution  of  states  at  any  time 
instant.  This  is  not  possible  with  a  MCM  PS  because  of  unique 
occurrences.  The  eigenvalues  of  P"  are  the  nth  powers  of 
the  eigenvalues  of  P  and  approach  the  eigenvalues  of  Poo, 
which  are  zero  except  for  one  that  equals  one.  As  stochastic 
matrices,  P",  n  >  1,  and  Poo  each  have  a  unity  eigenvalue, 
thus  the  magnitudes  of  the  non-unity  eigenvalues  of  P^  are 
less  than  one  (in  order  for  the  nth  powers  of  the  remaining 
n  -  1  eigenvalues  to  approach  zero  as  in  Poo).  Since  p  is 
prime,  m  =  p  —  1  is  even  hence  there  is  at  least  one  more 
real  eigenvalue  Ao„  0,  lAo„|  <  1.  for  each  of  P",  n  >  1  and 
the  tendency  to  improve  randomness  as  n  -►  oo  reflects  the 
tendency  of  Ao„  to  approach  zero.  A  randomness  measure  for 
a  TIMP  with  doubly  stochastic  PTMP  could  be  the  inverse  of 
the  largest  of  the  magnitudes  of  all  real,  non-unity,  eig^values 
of  P.  This  however  is  impractical  since  limn—oo  Ao„  —  oo.  If 
for  P  =  [p.y],  =  [a.y],  iJ  =  1. »  where  a.-y  =  p.y  if  p.y  =  0 

and  o.y  =  1  if  p.y  /  0,  then  a  more  attractive  measure  is 
9i  =  max,'  {A,-  |1A,'1  <  1  ,3x,'  3  Fadj^i  —  A.-x,-}  <  P  —  1-  This 
is  well  defined  if  as  conjectured,  »  increases  as  Pa*'  becomes 
less  sparse. 

III.  Example 

For  p  =  11,  fc  =  3  one  implementation  of  the  ACM  yields  Pb. 

03670369267036947123921458581470503690925818149258369 

47036920379258147069258258581470381473614702570364692 

58136925814147.  The  (doubly  stochastic)  PTM  is 

•001801  10  0  1‘ 
00119000  1  0 

1  1  0  1  0  8  1  0  0  0 
00000091  1  1 

1  0  1  0  0  0  1  1  8  0  1 
^^12  10000001  10  0 
01001002  0  8 
91010000  0  1 
08110200  0  0 
10902000  0  0 

and  SI  =  4.07.  Interleaving  every  30th  entry  until  all  entries 
are  exhausted  yields  a  PS  with  P„iy  less  sparse  and  Si  =  6.48. 
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Abstract  —  Probability  models  are  estimated  by  use 
of  penalized  likelihood  criteria  related  to  AIC  and 
MDL.  The  asymptotic  risk  of  the  density  estimator 
is  determined,  under  conditions  on  the  penalty  term, 
and  is  shown  to  be  minimax  optimal.  As  an  applica¬ 
tion,  we  show  that  the  optimal  rate  of  convergence  is 
achieved  for  density  in  certain  smooth  nonparametric 
families  without  knowing  the  smooth  parameters  in 
advance. 

1.  Introduction 

Both  AIC  [1]  and  MDL  [2]  aire  widely  used  model  selection 
criteria  bcised  on  information-theoretic  considerations.  Re¬ 
cent  work  described  in  the  talk  by  Bairron  at  this  workshop 
suggests  that  in  certain  caises  the  minimum  description  length 
principle  can  yield  a  minimax  optimal  criterion  of  the  form 
-log(likelihood)-l-const  •  m  as  opposed  to  -log(likelihood)  -f 
Y  log(n)  where  m  is  the  number  of  parameters  in  the  model 
and  n  is  the  sample  size.  The  penalty  term  in  this  crite¬ 
rion  is  of  the  same  order  as  that  in  AIC  which  takes  the 
form  -log(likelihood)-fm.  Previously  an  asymptotically  op¬ 
timal  property  was  obtained  for  AIC  applied  to  sequence  of 
linear  models  in  estimating  a  nonparametric  regression  func¬ 
tion  with  fixed  design  [3] [4].  In  this  work,  we  consider  criteria 
of  the  form 

n 

+  Ajfemjfe 

J=:l 

where  A*  is  a  positive  constant  and  9k  is  the  maximum  likeli¬ 
hood  estimator  of  9k  in  model  k.  In  this  work  A*  is  specified  so 
that  the  desired  asymptotic  results  hold.  Here  Afi.,...,A'n  are 
an  i.i.d.  sample  from  an  unknown  density  f{x)  w.r.t.  some 
cr-finite  mecisure. 

To  handle  also  selection  problem  involving  large  numbers 
of  models  of  each  dimension  m*,,  we  consider  criteria  of  the 
form 

n 

-^log/fc(Ai,^*)  +  A*m* -I- C*  (♦) 

i=l 

where  Ck  is  a  model  complexity  satisfying  Kraft’s  inequality 
2“'^''  <  1.  We  note  however  that  A^m*.  does  not  neces¬ 
sarily  correspond  to  a  description  of  estimated  parameters,  so 
(*)  does  not  necessarily  have  a  total  description  length  inter¬ 
pretation,  so  that  the  work  of  Bcirron  and  Cover  [5]  does  not 
apply. 

We  evaluate  the  new  criteria  by  compeiring  the  Hellinger 
loss  d^nift/kg-)  with  an  index  of  resolvability.  The  concept 
of  resolvability  was  introduced  in  [5].  It  naturally  captures  the 
capability  of  estimating  an  unknown  function  by  a  sequence 
of  models.  The  index  of  resolvability  can  be  defined  as 

RM)  =  mf{^  mf  ^  dUf,  fk,g,)  +  ^  +  ^}  . 

The  first  term  inf e^ee,,  dnif,  fk,ek)  reflects  the  approxima¬ 
tion  capability  of  the  model  k  to  the  true  function  /(x),  the 


second  term  ^  reflects  the  Vciriation  due  to  estimating  the 
best  parameters  in  the  model,  and  the  Icist  term  ^  reflects 
the  complexity  of  the  model  relative  to  the  sample  size.  The 
index  of  resolvability  quantifies  the  best  tradeoff  among  the 
approximation  error,  the  estimation  error  and  the  model  com¬ 
plexity. 

II.  Main  results 

It  is  shown  in  this  work  that  with  the  new  criteria  eind  un¬ 
der  some  reasonable  smoothness  conditions  on  the  paramet¬ 
ric  families  and  under  some  restriction  on  A*,,  the  Hellinger 
loss  d%{f,  fj.  )  is  bounded  in  probability  by  the  index  of  re¬ 
solvability  Rn\f)-  With  some  additional  conditions,  the  risk 
EdPn{fifk  g-)  proved  to  be  bounded  by  a  multiple  of  the 
index  of  resolvability  Rn(f),  i.e., 

Edlif,fi^J^)=OiRr.{f)) 

As  a  consequence,  by  examining  the  index  of  resolvability 
for  various  nonparametric  class  of  functions,  the  convergence 
rates  of  the  modified  AIC  procedure  can  be  easily  upper- 
bounded.  For  some  cases,  the  optimal  rate  of  convergence 
is  shown  to  be  achieved. 

III.  An  APPLICATION 

As  an  application,  we  consider  estimating  a  density  function 
on  [0,1]  using  a  sequence  of  exponential  families  with  spline 
basis  functions.  The  logarithm  of  the  density  is  assumed  to  be 
in  the  Sobolev  space  Wf  (which  consists  of  all  the  functions  on 
[0,1]  having  s  square-integrable  derivatives)  with  s  unknown. 
The  new  criterion  is  used  to  select  the  spline  order  and  the 
number  of  knots.  For  each  s  and  each  number  of  knots,  sep¬ 
arate  spline  models  are  considered  for  each  radius  constraint 
II  log /(^^i  Hoc.  <  r,  r=l,2,...  .The  correspoinding  A*  depends 
on  r  and  s.  We  conclude  that  the  optimal  rate  of  convergence 
is  achieved  simultaneously  for  density  function  f  with  /  6  Wi 
for  all  s  without  knowing  it  in  advance. 
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ABSTRACT 

A  wavelet-based  neural  network  is  described.  The  net¬ 
work  is  similar  to  the  radial  basis  function  (RBF)  network, 
except  that  the  RBF’s  are  replaced  by  orthonormal  scaling 
functions.  It  has  been  shown  that  the  wavelet  network  has 
universal  and  approximation  properties  and  is  a  consis¬ 
tent  function  estimator.  Convergence  rates,  which  avoid  the 
“curse  of  dimensionality,”  are  obtained  for  certain  function 
classes.  The  network  also  compared  favorably  to  the  MLP 
and  RBF  networks  in  the  experiments. 

1.  INTRODUCTION 

Recently,  neural  networks  have  become  a  popular  tool  in 
non-parametric  function  learning.  While  the  multi-layer 
perceptron  (MLP)  is  probably  the  most  frequently  used, 
its  training  process  often  converge  too  slowly  or  settle  in 
undesirable  local  minima.  The  radial  basis  function  (RBF) 
network  can  be  trained  more  easily  provided  that  certain 
network  parameters  (e.g.,  the  centers  and  the  variances) 
are  properly  preset.  As  a  function  representation  scheme, 
the  RBF  network  uses  a  family  of  locally-supported  ba¬ 
sis  functions,  which  allows  it  to  represent  a  “rich”  class  of 
functions.  However,  since  the  basis  functions  are  generally 
non-orthogonal,  the  RBF  representation  is  not  unique  (co¬ 
efficients  “harder  to  learn”)  and  not  the  most  efficient. 

In  this  work,  we  replace  the  basis  functions  in  the  RBF 
network  by  an  orthonormal  basis,  namely,  the  scaling  func¬ 
tions  associated  with  a  orthonormal  wavelet  basis.  For  a 
given  function,  this  “wavelet  network”,  provides  a  unique 
(coefficients  “easy  to  learn”)  and  efficient  representation. 
The  use  of  orthonormal  scaling  functions  also  facilitates  the 
theoretical  analysis  the  network,  such  as  universal  approx¬ 
imation  and  consistency.  The  idea  of  using  orthonormal 
wavelets  in  neural  networks  has  also  been  investigated  re¬ 
cently  Zhang  and  Benveniste  and  by  Pati  and  Krishnaprasad 
(see  the  reference  list  of  [1]),  who  use  non-orthogonal  wavelets, 
and  by  Boubez  (see  also  [1]),  who  is  concerned  more  with 
classification  than  with  function  learning. 

2.  WAVELET  NETWORKS 

In  this  section,  we  briefly  summarize  the  main  theoretical 
and  experimental  results  related  to  the  wavelet  network. 
More  details  can  be  found  in  [1].  For  the  sake  of  simplicity, 
we  first  look  at  the  1-D  case  (one  input  and  one  output). 


Since  in  most  practical  applications,  the  function  of  in¬ 
terest  has  finite  snpport,  we  assume  that,  without  lose  of 
generality,  that  f{t)  £  i^(R)  and  has  finite  support.  Let 
g{t)  be  a  wavelet-based  approximation  to  f(t).  Then,  there 
exists  a  sufficiently  large  M,  such  that 

f(i)  ~  g{t)  =  '^Ck<p{2^t  -  k){t)-  (1) 

k 

where  <p(t)  is  a  compactly-supported  (or  fast-decaying)  scal¬ 
ing  function  and  k  runs  through  a  finite  set  of  integers.  g{t) 
can  be  implemented  as  a  three-layer  network  [1]  and  Ck  s 
can  be  estimated  by  minimizing  the  mean  square  error  be¬ 
tween  f(t)  and  g(t)  over  a  training  data  set.  Since  multi¬ 
dimensional  scaling  functions  can  be  obtained  easily  from 
the  extension  of  the  network  to  dimensions  higher  than 
one  is  straightforward. 

The  theoretical  results  related  to  the  wavelet  network 
are  described  by  the  following  three  theorems: 

Theorem  1.  The  wavelet  network  has  the  properties 
of  universal  approximation  and  approximation. 

Theorem  2.  The  rates  of  convergence  for  Theorem  1 
can  be  made  arbitrarily  rapid  in  the  following  sense:  for 
any  a  >  0,  there  is  a  Sobolev  space  such  that  for  any 
f  g  H/3,  there  exists  a  sequence  of  wavelet  networks  /n, 
where  n  =  2^,  such  that 

||/_/„||„  =  0(n-“),  |l/-/„IL.=0(n-“).  (2) 

Here  IHk  and  11-112,2  are  the  sup  and  norms,  respectively. 

Theorem  3.  Assume  that  the  training  data  are  i.i.d. 
and  uniformly  distributed.  Then,  the  wavelet  network  is 
consistent  in  the  mean  square  sense  and  the  rate  of  conver¬ 
gence  for  the  coefficients  is  0(1/A),  where  N  is  the  size  of 
training  data  set. 

The  proof  of  these  theorems  can  be  found  in  [1].  In  the 
experiments,  the  wavelet  network  performed  better  than 
the  MLP  with  similar  complexity  in  learning  discontinuous 
functions  and  the  performance  of  the  RBF  became  compa¬ 
rable  to  the  wavelet  network  only  when  some  of  its  param¬ 
eters  are  preset  according  to  the  wavelet  network. 
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Abstract  -  LempcI-Ziv-Welch  methods  and  their 
variations  are  all  based  on  the  principle  of  using  a 
prescribed  parsing  rule  to  find  duplicate  occurrences  of 
data  and  encoding  the  repeated  strings  with  some  sort  of 
special  code  word  identifying  the  data  to  be  replaced. 
This  paper  includes  a  general  presentation  of  five 
existing  lossless  compression  methods  used  in  any 
application  of  digital  signal  processing.  The  comparisons 
are  made  experimentally  by  computer  simulation. 


I.  Introduction 

The  puipose  of  this  paper  is  to  compare  the  compression 
performances  of  five  lossless  compression  algorithms 
applied  to  various  types  of  files. 

II.  Presentation  of  Algorithms 

Algorithm  1  [1),  [3]  is  the  Lempel-Ziv  1977  method, 
Algorithm  2  [2]  is  the  Lempel-Ziv  1978  method,  Algorithm 
3  [1],  [51  is  an  intermediate  1992  method  of  Agorithms  1 
and  2,  Agorithm  4  [5J  is  the  Welch  variant  of  Agorithm  3 
and  Agorithm  5  [5]  is  an  intermediate  method  of  LZW 
method  [4]  and  Agorithm  4.  Note  that  the  LZW  method  is 
a  practical  Welch  type  variant  of  Agorithm  2. 

III.  Simulation  Results  and  Remarks 

In  order  to  investigate  the  performance  of  the  practical 
schemes,  the  proposed  algorithms  have  been  implemented 
by  experimental  computer  programs,  wliich  were  tested 
against  various  kinds  of  byte-oriented  data.  In  addition  to 
the  compression  ratio  R(n),  the  size  S(n)  of  the 
correspohding  string  table  is  shown  in  the  table  as  a 
function  of  the  input  length  n. 


n  [bytes] 

too 

200 

500 

1000 

2000 

5000 

10000 

Alg. 

R(n) 

.72 

.65 

.56 

.50 

.41 

.35 

.32 

4 

S(n) 

355 

451 

748 

1250 

2247 

52.52 

10254 

Alg. 

R(ii) 

.75 

.67 

.58 

.52 

.46 

.39 

.35 

5 

S(n) 

32t 

384 

625 

809 

1627 

3074 

5103 

LZW 

R(n) 

.80 

.72 

.64 

.56 

.51 

.42 

.38 

meth. 

S(n) 

298 

370 

420 

701 

996 

1675 

2982 

In  order  to  study  the  asymptotic  convergence  of  the 
compression  ratio,  we  used;  a  Turbo  Pascal  file  of  length 
6250  bytes,  a  maximum  length  of  32  b>tes  for  the  source 
words,  a  length  of  3  bytes  for  the  code  words  and  various 
values  for  the  encoder  buffer  length.  The  results  are; 


nb  [bytes]  1 00  200  400  800  1 600 

Algoritlim  1  R(n)  76  .57  .42  .37  .33 

Algoritlun  4  R(n)  =  .42 
Algorithm  5  R(n)  =  .45 
LZWmeth.  R(n)  =  .51 

The  algorithms  have  also  been  tested  on  different  types  of 
program  and  te.xl  files.  The  values  obtained  for  tlie 
compression  ratio  R(n)  are  shown  in  the  following; 


File 

Type 

Size 

|b)1es] 

Alg.  4 

Alg.  5 

LZW 

#  I 

Pascal 

2993 

.50 

.55 

.60 

#2 

Text 

1237 

.77 

.81 

.87 

#3 

Pascal 

6250 

.42 

.45 

51 

#  4 

Pascal 

1 0423 

.32 

.35 

.38 

The  best  compression  ratio  is  given  by  Agorithm  4  In 
general,  Agoritlim  4  shows  between  10  an  16  percent 
improvement  over  the  LZW  method,  and  Agorithm  5 
shows  between  7  and  1 1  percent  improvement  over  LZW. 
Good  values  for  the  compression  ratio  are  obtained  only  for 
input  sequences  with  great  length.  For  short-length  files 
with  small  entropy ,  Agorithm  1  is  the  best.  Aso.  regarding 
the  memory  space  for  encoding,  Agorithm  i  has  the  best 
performances,  becatise  the  other  ones  requires  greater 
memory  space  for  developing  the  string  tables. 
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